[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.
[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217979#comment-16217979 ] Peng Meng commented on SPARK-22277: --- For problem 1 and 2, could you please post the test code. For problem 1, one possible case is all the feature ChiSquare statistics value is the same, no matter you select which feature, the result is right. To code is helpful for analysis of the problem, > Chi Square selector garbling Vector content. > > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at > org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at > org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t > at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at > org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at > sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > +--+---+--+ > |prediction|clicked|
[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame
[ https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209166#comment-16209166 ] Peng Meng commented on SPARK-22295: --- Why there is a space for the column names? ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', > Chi Square selector not recognizing field in Data frame > --- > > Key: SPARK-22295 > URL: https://issues.apache.org/jira/browse/SPARK-22295 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > ChiSquare selector is not recognizing the field 'class' which is present in > the data frame while fitting the model. I am using PIMA Indians diabetes > dataset of UCI. The complete code and output is below for reference. But, > when some rows of the input file is created as a dataframe manually, it will > work. Couldn't understand the pattern here. > Kindly help. > {code:python} > from pyspark.ml.feature import VectorAssembler, ChiSqSelector > import sys > file_name='data/pima-indians-diabetes.data' > df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() > df.show(1) > assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' > test', ' mass', ' pedi', ' age'],outputCol="features") > df=assembler.transform(df) > df.show(1) > try: > css=ChiSqSelector(numTopFeatures=5, featuresCol="features", > outputCol="selected", labelCol='class').fit(df) > except: > print(sys.exc_info()) > {code} > Output: > ++-+-+-+-+-+-++--+ > |preg| plas| pres| skin| test| mass| pedi| age| class| > ++-+-+-+-+-+-++--+ > | 6| 148| 72| 35|0| 33.6|0.627| 50| 1| > ++-+-+-+-+-+-++--+ > only showing top 1 row > ++-+-+-+-+-+-++--++ > |preg| plas| pres| skin| test| mass| pedi| age| class|features| > ++-+-+-+-+-+-++--++ > | 6| 148| 72| 35|0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...| > ++-+-+-+-+-+-++--++ > only showing top 1 row > (, > IllegalArgumentException('Field "class" does not exist.', > 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t > at > org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at > scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at > org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at > org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t > at > org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t > at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > *The below code works fine: > * > {code:python} > from pyspark.ml.feature import VectorAssembler, ChiSqSelector > import sys > file_name='data/pima-indians-diabetes.data' > #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() > # Just pasted a few rows from the input file and created a data frome. This > will work, but not the frame picked up from the file > df = spark.createDataFrame([ > [6,148,72,35,0,33.6,0.627,50,1], > [1,85,66,29,0,26.6,0.351,31,0], > [8,183,64,0,0,23.3,0.672,32,1], > ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', > "class"]) > df.show(1) > assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' > test', ' mass', ' pedi', ' age'],outputCol="features") > df=assembler.transform(df) > df.show(1) > try: > css=ChiSqSelector(numTopFeatures=5, featuresCol="features", > outputCol="selecte
[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame
[ https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209164#comment-16209164 ] Peng Meng commented on SPARK-22295: --- Please use labelCol=" class", not labelCol="class". > Chi Square selector not recognizing field in Data frame > --- > > Key: SPARK-22295 > URL: https://issues.apache.org/jira/browse/SPARK-22295 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > ChiSquare selector is not recognizing the field 'class' which is present in > the data frame while fitting the model. I am using PIMA Indians diabetes > dataset of UCI. The complete code and output is below for reference. But, > when some rows of the input file is created as a dataframe manually, it will > work. Couldn't understand the pattern here. > Kindly help. > {code:python} > from pyspark.ml.feature import VectorAssembler, ChiSqSelector > import sys > file_name='data/pima-indians-diabetes.data' > df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() > df.show(1) > assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' > test', ' mass', ' pedi', ' age'],outputCol="features") > df=assembler.transform(df) > df.show(1) > try: > css=ChiSqSelector(numTopFeatures=5, featuresCol="features", > outputCol="selected", labelCol='class').fit(df) > except: > print(sys.exc_info()) > {code} > Output: > ++-+-+-+-+-+-++--+ > |preg| plas| pres| skin| test| mass| pedi| age| class| > ++-+-+-+-+-+-++--+ > | 6| 148| 72| 35|0| 33.6|0.627| 50| 1| > ++-+-+-+-+-+-++--+ > only showing top 1 row > ++-+-+-+-+-+-++--++ > |preg| plas| pres| skin| test| mass| pedi| age| class|features| > ++-+-+-+-+-+-++--++ > | 6| 148| 72| 35|0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...| > ++-+-+-+-+-+-++--++ > only showing top 1 row > (, > IllegalArgumentException('Field "class" does not exist.', > 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t > at > org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at > scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at > org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at > org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t > at > org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t > at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > *The below code works fine: > * > {code:python} > from pyspark.ml.feature import VectorAssembler, ChiSqSelector > import sys > file_name='data/pima-indians-diabetes.data' > #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() > # Just pasted a few rows from the input file and created a data frome. This > will work, but not the frame picked up from the file > df = spark.createDataFrame([ > [6,148,72,35,0,33.6,0.627,50,1], > [1,85,66,29,0,26.6,0.351,31,0], > [8,183,64,0,0,23.3,0.672,32,1], > ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', > "class"]) > df.show(1) > assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' > test', ' mass', ' pedi', ' age'],outputCol="features") > df=assembler.transform(df) > df.show(1) > try: > css=ChiSqSelector(numTopFeatures=5, featuresCol="features", > outputCol="selected", labelCol="class").fit(df) > except: > print(sy
[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame
[ https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208708#comment-16208708 ] Peng Meng commented on SPARK-22295: --- Hi [~cheburakshu] , thanks for reporting this bug and helpful code. This is caused by similar problem but not the same thing as SPARK-22277. The reason is when transform a dataframe, the field/attribute is not correctly set. Maybe there are some other similar bugs in the code, we can solve them separately, or solve them together. [~yanboliang] [~mlnick] [~srowen] > Chi Square selector not recognizing field in Data frame > --- > > Key: SPARK-22295 > URL: https://issues.apache.org/jira/browse/SPARK-22295 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > ChiSquare selector is not recognizing the field 'class' which is present in > the data frame while fitting the model. I am using PIMA Indians diabetes > dataset of UCI. The complete code and output is below for reference. But, > when some rows of the input file is created as a dataframe manually, it will > work. Couldn't understand the pattern here. > Kindly help. > {code:python} > from pyspark.ml.feature import VectorAssembler, ChiSqSelector > import sys > file_name='data/pima-indians-diabetes.data' > df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() > df.show(1) > assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' > test', ' mass', ' pedi', ' age'],outputCol="features") > df=assembler.transform(df) > df.show(1) > try: > css=ChiSqSelector(numTopFeatures=5, featuresCol="features", > outputCol="selected", labelCol='class').fit(df) > except: > print(sys.exc_info()) > {code} > Output: > ++-+-+-+-+-+-++--+ > |preg| plas| pres| skin| test| mass| pedi| age| class| > ++-+-+-+-+-+-++--+ > | 6| 148| 72| 35|0| 33.6|0.627| 50| 1| > ++-+-+-+-+-+-++--+ > only showing top 1 row > ++-+-+-+-+-+-++--++ > |preg| plas| pres| skin| test| mass| pedi| age| class|features| > ++-+-+-+-+-+-++--++ > | 6| 148| 72| 35|0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...| > ++-+-+-+-+-+-++--++ > only showing top 1 row > (, > IllegalArgumentException('Field "class" does not exist.', > 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t > at > org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at > scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at > org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at > org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t > at > org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t > at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > *The below code works fine: > * > {code:python} > from pyspark.ml.feature import VectorAssembler, ChiSqSelector > import sys > file_name='data/pima-indians-diabetes.data' > #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() > # Just pasted a few rows from the input file and created a data frome. This > will work, but not the frame picked up from the file > df = spark.createDataFrame([ > [6,148,72,35,0,33.6,0.627,50,1], > [1,85,66,29,0,26.6,0.351,31,0], > [8,183,64,0,0,23.3,0.672,32,1], > ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', > "class"]) > df.show(1) > assembler = VectorAssembler(
[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.
[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207146#comment-16207146 ] Peng Meng commented on SPARK-22277: --- This seems is a bug. If no one is working on it. I can work on it. > Chi Square selector garbling Vector content. > > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at > org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at > org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t > at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at > org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at > sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > +--+---+--+ > |prediction|clicked| features| > +--+---+--+ > | 1.0|1.0|[0.0,0.0,18.0,1.0]| > | 0.0|0.0|[0.0,1.0,12.0,0.0]| > | 0.0|0.0|[1.0,0.0,15.0,0.1]| > +-
[jira] [Commented] (SPARK-22115) Add operator for linalg Matrix and Vector
[ https://issues.apache.org/jira/browse/SPARK-22115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16196412#comment-16196412 ] Peng Meng commented on SPARK-22115: --- ok, thanks. > Add operator for linalg Matrix and Vector > - > > Key: SPARK-22115 > URL: https://issues.apache.org/jira/browse/SPARK-22115 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Peng Meng > > For example, there are many code in LDA like this: > {code:java} > phiNorm := expElogbetad * expElogthetad +:+ 1e-100 > {code} > expElogbetad is a breeze Matrix, expElogthetad is a breeze Vector, > This code will call a blas GEMV, then loop the result (:+ 1e-100) > Actually, this can be done with only GEMV, because the standard interface of > gemv is : > gemv(alpha, A, x, beta, y)//y := alpha*A*x + beta*y > We can provide some operators (e.g. Element-wise product (:*), Element-wise > sum (:+)) to Spark linalg Matrix and Vector, and replace breeze Matrix and > Vector by Spark linalg Matrix and Vector. > Then for all the cases like: y = alpha*A*x + beta*y, we can call GEMM or GEMV > for it. > Don't need to call GEMM or GEMV and then loop the result (for the add) as the > current implementation. > I can help to do it if we plan to add this feature. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22115) Add operator for linalg Matrix and Vector
[ https://issues.apache.org/jira/browse/SPARK-22115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16196115#comment-16196115 ] Peng Meng commented on SPARK-22115: --- Hi [~mlnick], I am just back from a national holiday. If only for LDA, it is ok to make this private. But I think it is better to expose a new public API, then we can extend this API, and finally replace breeze Matrix and Vector. > Add operator for linalg Matrix and Vector > - > > Key: SPARK-22115 > URL: https://issues.apache.org/jira/browse/SPARK-22115 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Peng Meng > > For example, there are many code in LDA like this: > {code:java} > phiNorm := expElogbetad * expElogthetad +:+ 1e-100 > {code} > expElogbetad is a breeze Matrix, expElogthetad is a breeze Vector, > This code will call a blas GEMV, then loop the result (:+ 1e-100) > Actually, this can be done with only GEMV, because the standard interface of > gemv is : > gemv(alpha, A, x, beta, y)//y := alpha*A*x + beta*y > We can provide some operators (e.g. Element-wise product (:*), Element-wise > sum (:+)) to Spark linalg Matrix and Vector, and replace breeze Matrix and > Vector by Spark linalg Matrix and Vector. > Then for all the cases like: y = alpha*A*x + beta*y, we can call GEMM or GEMV > for it. > Don't need to call GEMM or GEMV and then loop the result (for the add) as the > current implementation. > I can help to do it if we plan to add this feature. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22115) Add operator for linalg Matrix and Vector
[ https://issues.apache.org/jira/browse/SPARK-22115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180292#comment-16180292 ] Peng Meng commented on SPARK-22115: --- Thanks [~srowen], I will begin to work on it. > Add operator for linalg Matrix and Vector > - > > Key: SPARK-22115 > URL: https://issues.apache.org/jira/browse/SPARK-22115 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Peng Meng > > For example, there are many code in LDA like this: > {code:java} > phiNorm := expElogbetad * expElogthetad +:+ 1e-100 > {code} > expElogbetad is a breeze Matrix, expElogthetad is a breeze Vector, > This code will call a blas GEMV, then loop the result (:+ 1e-100) > Actually, this can be done with only GEMV, because the standard interface of > gemv is : > gemv(alpha, A, x, beta, y)//y := alpha*A*x + beta*y > We can provide some operators (e.g. Element-wise product (:*), Element-wise > sum (:+)) to Spark linalg Matrix and Vector, and replace breeze Matrix and > Vector by Spark linalg Matrix and Vector. > Then for all the cases like: y = alpha*A*x + beta*y, we can call GEMM or GEMV > for it. > Don't need to call GEMM or GEMV and then loop the result (for the add) as the > current implementation. > I can help to do it if we plan to add this feature. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22115) Add operator for linalg Matrix and Vector
Peng Meng created SPARK-22115: - Summary: Add operator for linalg Matrix and Vector Key: SPARK-22115 URL: https://issues.apache.org/jira/browse/SPARK-22115 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 3.0.0 Reporter: Peng Meng For example, there are many code in LDA like this: {code:java} phiNorm := expElogbetad * expElogthetad +:+ 1e-100 {code} expElogbetad is a breeze Matrix, expElogthetad is a breeze Vector, This code will call a blas GEMV, then loop the result (:+ 1e-100) Actually, this can be done with only GEMV, because the standard interface of gemv is : gemv(alpha, A, x, beta, y)//y := alpha*A*x + beta*y We can provide some operators (e.g. Element-wise product (:*), Element-wise sum (:+)) to Spark linalg Matrix and Vector, and replace breeze Matrix and Vector by Spark linalg Matrix and Vector. Then for all the cases like: y = alpha*A*x + beta*y, we can call GEMM or GEMV for it. Don't need to call GEMM or GEMV and then loop the result (for the add) as the current implementation. I can help to do it if we plan to add this feature. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22114) The condition of OnlineLDAOptimizer convergence should be configurable
Peng Meng created SPARK-22114: - Summary: The condition of OnlineLDAOptimizer convergence should be configurable Key: SPARK-22114 URL: https://issues.apache.org/jira/browse/SPARK-22114 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.3.0 Reporter: Peng Meng Priority: Minor The current convergence of OnlineLDAOptimizer is: {code:java} while(meanGammaChange > 1e-3) {code} The condition of this is critical for the performance and accuracy of LDA. We should keep this configurable, like it is in Vowpal Vabbit: https://github.com/JohnLangford/vowpal_wabbit/blob/430f69453bc4876a39351fba1f18771bdbdb7122/vowpalwabbit/lda_core.cc :638 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155018#comment-16155018 ] Peng Meng commented on SPARK-21476: --- When there is 50 instances for each partition, this Broadcast is a little faster than the current solution, When there is 500 instances for each partition, this Broadcast is a little slower than the current solution. Both of the difference is about 10%. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal >Priority: Minor > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154849#comment-16154849 ] Peng Meng commented on SPARK-21476: --- I think the performance difference between my test and yours is the number of records in each partition. In my test, I use: model.transform(data) There are many instances (e.g. 10k) in each partition, so the deserialize time of the model is not a big part comparing with the end-to-end time. In your cases, maybe there are very few instances in each partition, so the deserialize time seems very long. In my cases, when use broadcast, the end-to-end time is long comparing the current solution. We can share the test configuration and do more test. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal >Priority: Minor > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136538#comment-16136538 ] Peng Meng commented on SPARK-21476: --- I don't think it can be argued that "In any case using broadcasting causes no harm". Actually, in this cases, my test shows using broadcasting the performance is a little bad comparing with current solution. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121284#comment-16121284 ] Peng Meng commented on SPARK-21688: --- MKL is just an example of native BLAS, if user has Openblas, ATLAS, an so on. It also works. > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, > native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization
[ https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121175#comment-16121175 ] Peng Meng commented on SPARK-21680: --- I mean if the user call toSparse(size), but the size is smaller than numNonZero, there maybe problem. > ML/MLLIB Vector compressed optimization > --- > > Key: SPARK-21680 > URL: https://issues.apache.org/jira/browse/SPARK-21680 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > When use Vector.compressed to change a Vector to SparseVector, the > performance is very low comparing with Vector.toSparse. > This is because you have to scan the value three times using > Vector.compressed, but you just need two times when use Vector.toSparse. > When the length of the vector is large, there is significant performance > difference between this two method. > Code of Vector compressed: > {code:java} > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > toSparse > } else { > toDense > } > } > {code} > I propose to change it to: > {code:java} > // Some comments here > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > val ii = new Array[Int](nnz) > val vv = new Array[Double](nnz) > var k = 0 > foreachActive { (i, v) => > if (v != 0) { > ii(k) = i > vv(k) = v > k += 1 > } > } > new SparseVector(size, ii, vv) > } else { > toDense > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization
[ https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120947#comment-16120947 ] Peng Meng commented on SPARK-21680: --- Hi [~srowen], if add toSparse(size), for secure reason, it is better to check size with numNonzeros, if size is larger than numNonzeros, the program may crash. If we check the size with numNonzeros, we still add one more scan to the value. So in this PR, I revise the code like this JIRA. Thanks. > ML/MLLIB Vector compressed optimization > --- > > Key: SPARK-21680 > URL: https://issues.apache.org/jira/browse/SPARK-21680 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > When use Vector.compressed to change a Vector to SparseVector, the > performance is very low comparing with Vector.toSparse. > This is because you have to scan the value three times using > Vector.compressed, but you just need two times when use Vector.toSparse. > When the length of the vector is large, there is significant performance > difference between this two method. > Code of Vector compressed: > {code:java} > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > toSparse > } else { > toDense > } > } > {code} > I propose to change it to: > {code:java} > // Some comments here > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > val ii = new Array[Int](nnz) > val vv = new Array[Double](nnz) > var k = 0 > foreachActive { (i, v) => > if (v != 0) { > ii(k) = i > vv(k) = v > k += 1 > } > } > new SparseVector(size, ii, vv) > } else { > toDense > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization
[ https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120136#comment-16120136 ] Peng Meng commented on SPARK-21680: --- Ok, thanks, I will submit a PR. > ML/MLLIB Vector compressed optimization > --- > > Key: SPARK-21680 > URL: https://issues.apache.org/jira/browse/SPARK-21680 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > When use Vector.compressed to change a Vector to SparseVector, the > performance is very low comparing with Vector.toSparse. > This is because you have to scan the value three times using > Vector.compressed, but you just need two times when use Vector.toSparse. > When the length of the vector is large, there is significant performance > difference between this two method. > Code of Vector compressed: > {code:java} > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > toSparse > } else { > toDense > } > } > {code} > I propose to change it to: > {code:java} > // Some comments here > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > val ii = new Array[Int](nnz) > val vv = new Array[Double](nnz) > var k = 0 > foreachActive { (i, v) => > if (v != 0) { > ii(k) = i > vv(k) = v > k += 1 > } > } > new SparseVector(size, ii, vv) > } else { > toDense > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization
[ https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120115#comment-16120115 ] Peng Meng commented on SPARK-21680: --- Then we will have two toSparse: toSparse and toSparse(size) Do you mean that? Thanks. > ML/MLLIB Vector compressed optimization > --- > > Key: SPARK-21680 > URL: https://issues.apache.org/jira/browse/SPARK-21680 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > When use Vector.compressed to change a Vector to SparseVector, the > performance is very low comparing with Vector.toSparse. > This is because you have to scan the value three times using > Vector.compressed, but you just need two times when use Vector.toSparse. > When the length of the vector is large, there is significant performance > difference between this two method. > Code of Vector compressed: > {code:java} > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > toSparse > } else { > toDense > } > } > {code} > I propose to change it to: > {code:java} > // Some comments here > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > val ii = new Array[Int](nnz) > val vv = new Array[Double](nnz) > var k = 0 > foreachActive { (i, v) => > if (v != 0) { > ii(k) = i > vv(k) = v > k += 1 > } > } > new SparseVector(size, ii, vv) > } else { > toDense > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21624) Optimize communication cost of RF/GBT/DT
[ https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120105#comment-16120105 ] Peng Meng commented on SPARK-21624: --- Hi [~mlnick], how do you think about this: https://issues.apache.org/jira/browse/SPARK-21680 Thanks. > Optimize communication cost of RF/GBT/DT > > > Key: SPARK-21624 > URL: https://issues.apache.org/jira/browse/SPARK-21624 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > {quote}The implementation of RF is bound by either the cost of statistics > computation on workers or by communicating the sufficient statistics.{quote} > The statistics are stored in allStats: > {code:java} > /** >* Flat array of elements. >* Index for start of stats for a (feature, bin) is: >* index = featureOffsets(featureIndex) + binIndex * statsSize >*/ > private var allStats: Array[Double] = new Array[Double](allStatsSize) > {code} > The size of allStats maybe very large, and it can be very sparse, especially > on the nodes that near the leave of the tree. > I have changed allStats from Array to SparseVector, my tests show the > communication is down by about 50%. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21680) ML/MLLIB Vector compressed optimization
Peng Meng created SPARK-21680: - Summary: ML/MLLIB Vector compressed optimization Key: SPARK-21680 URL: https://issues.apache.org/jira/browse/SPARK-21680 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.3.0 Reporter: Peng Meng When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse. This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse. When the length of the vector is large, there is significant performance difference between this two method. Code of Vector compressed: {code:java} def compressed: Vector = { val nnz = numNonzeros // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 12 * nnz + 20 bytes. if (1.5 * (nnz + 1.0) < size) { toSparse } else { toDense } } {code} I propose to change it to: {code:java} // Some comments here def compressed: Vector = { val nnz = numNonzeros // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 12 * nnz + 20 bytes. if (1.5 * (nnz + 1.0) < size) { val ii = new Array[Int](nnz) val vv = new Array[Double](nnz) var k = 0 foreachActive { (i, v) => if (v != 0) { ii(k) = i vv(k) = v k += 1 } } new SparseVector(size, ii, vv) } else { toDense } } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21638) Warning message of RF is not accurate
[ https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-21638: -- Description: When train RF model, there is many warning message like this: {quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration.{quote} This warning message is unnecessary and the data is not accurate. Actually, if all the nodes cannot split in one iteration, it will show this warning. For most of the case, all the nodes cannot split just in one iteration, so for most of the case, it will show this warning for each iteration. This is because: {code:java} while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { val (treeIndex, node) = nodeStack.top // Choose subset of features for node (if subsampling). val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) { Some(SamplingUtils.reservoirSampleAndCount(Range(0, metadata.numFeatures).iterator, metadata.numFeaturesPerNode, rng.nextLong())._1) } else { None } // Check if enough memory remains to add this node to the group. val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, featureSubset) * 8L if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { nodeStack.pop() mutableNodesForGroup.getOrElseUpdate(treeIndex, new mutable.ArrayBuffer[LearningNode]()) += node mutableTreeToNodeToIndexInfo .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, NodeIndexInfo]())(node.id) = new NodeIndexInfo(numNodesInGroup, featureSubset) } numNodesInGroup += 1 //we not add the node to mutableNodesForGroup, but we add memUsage here. memUsage += nodeMemUsage } if (memUsage > maxMemoryUsage) { // If maxMemoryUsage is 0, we should still allow splitting 1 node. logWarning(s"Tree learning is using approximately $memUsage bytes per iteration, which" + s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This allows splitting" + s" $numNodesInGroup nodes in this iteration.") } {code} was: When train RF model, there is many warning message like this: {quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration.{quote} This warning message is unnecessary and the data is not accurate. This is because {code:java} while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { val (treeIndex, node) = nodeStack.top // Choose subset of features for node (if subsampling). val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) { Some(SamplingUtils.reservoirSampleAndCount(Range(0, metadata.numFeatures).iterator, metadata.numFeaturesPerNode, rng.nextLong())._1) } else { None } // Check if enough memory remains to add this node to the group. val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, featureSubset) * 8L if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { nodeStack.pop() mutableNodesForGroup.getOrElseUpdate(treeIndex, new mutable.ArrayBuffer[LearningNode]()) += node mutableTreeToNodeToIndexInfo .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, NodeIndexInfo]())(node.id) = new NodeIndexInfo(numNodesInGroup, featureSubset) } numNodesInGroup += 1 *//we not add the node to mutableNodesForGroup, but we add memUsage here.* memUsage += nodeMemUsage } if (memUsage > maxMemoryUsage) { // If maxMemoryUsage is 0, we should still allow splitting 1 node. logWarning(s"Tree learning is using approximately $memUsage bytes per iteration, which" + s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This allows splitting" + s" $numNodesInGroup nodes in this iteration.") } {code} To avoid this unnecessary warning, we should change the code like this: {code:java} while (nodeStack.nonEmpty) { val (treeIndex, node) = nodeStack.top // Choose subset of features for node (if subsampling). val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) { Some(SamplingUtils.reservoirSampleAndCount(Range(0, metadata.numFeatures).iterator, metadata.numFeaturesPerNode, rng.nextLong())._1) } else { None } // Check if enough memory remains to add this node to the group. val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, featureSubset) * 8L if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { nodeStack.p
[jira] [Updated] (SPARK-21624) Optimize communication cost of RF/GBT/DT
[ https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-21624: -- Description: {quote}The implementation of RF is bound by either the cost of statistics computation on workers or by communicating the sufficient statistics.{quote} The statistics are stored in allStats: {code:java} /** * Flat array of elements. * Index for start of stats for a (feature, bin) is: * index = featureOffsets(featureIndex) + binIndex * statsSize */ private var allStats: Array[Double] = new Array[Double](allStatsSize) {code} The size of allStats maybe very large, and it can be very sparse, especially on the nodes that near the leave of the tree. I have changed allStats from Array to SparseVector, my tests show the communication is down by about 50%. was: {quote}The implementation of RF is bound by either the cost of statistics computation on workers or by communicating the sufficient statistics.{quote} The statistics are stored in allStats: {code:java} /** * Flat array of elements. * Index for start of stats for a (feature, bin) is: * index = featureOffsets(featureIndex) + binIndex * statsSize */ private var allStats: Array[Double] = new Array[Double](allStatsSize) {code} The size of allStats maybe very large, and it can be very spare, especially on the nodes that near the leave of the tree. I have changed allStats from Array to SparseVector, my tests show the communication is down by about 50%. > Optimize communication cost of RF/GBT/DT > > > Key: SPARK-21624 > URL: https://issues.apache.org/jira/browse/SPARK-21624 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > {quote}The implementation of RF is bound by either the cost of statistics > computation on workers or by communicating the sufficient statistics.{quote} > The statistics are stored in allStats: > {code:java} > /** >* Flat array of elements. >* Index for start of stats for a (feature, bin) is: >* index = featureOffsets(featureIndex) + binIndex * statsSize >*/ > private var allStats: Array[Double] = new Array[Double](allStatsSize) > {code} > The size of allStats maybe very large, and it can be very sparse, especially > on the nodes that near the leave of the tree. > I have changed allStats from Array to SparseVector, my tests show the > communication is down by about 50%. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21638) Warning message of RF is not accurate
[ https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114156#comment-16114156 ] Peng Meng commented on SPARK-21638: --- The first data should - nodeMemUsage > Warning message of RF is not accurate > - > > Key: SPARK-21638 > URL: https://issues.apache.org/jira/browse/SPARK-21638 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: >Reporter: Peng Meng >Priority: Minor > > When train RF model, there is many warning message like this: > {quote}WARN RandomForest: Tree learning is using approximately 268492800 > bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. > This allows splitting 2622 nodes in this iteration.{quote} > This warning message is unnecessary and the data is not accurate. > This is because > {code:java} > while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { > val (treeIndex, node) = nodeStack.top > // Choose subset of features for node (if subsampling). > val featureSubset: Option[Array[Int]] = if > (metadata.subsamplingFeatures) { > Some(SamplingUtils.reservoirSampleAndCount(Range(0, > metadata.numFeatures).iterator, metadata.numFeaturesPerNode, > rng.nextLong())._1) > } else { > None > } > // Check if enough memory remains to add this node to the group. > val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, > featureSubset) * 8L > if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { > nodeStack.pop() > mutableNodesForGroup.getOrElseUpdate(treeIndex, new > mutable.ArrayBuffer[LearningNode]()) += > node > mutableTreeToNodeToIndexInfo > .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, > NodeIndexInfo]())(node.id) > = new NodeIndexInfo(numNodesInGroup, featureSubset) > } > numNodesInGroup += 1 *//we not add the node to mutableNodesForGroup, > but we add memUsage here.* > memUsage += nodeMemUsage > } > if (memUsage > maxMemoryUsage) { > // If maxMemoryUsage is 0, we should still allow splitting 1 node. > logWarning(s"Tree learning is using approximately $memUsage bytes per > iteration, which" + > s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This > allows splitting" + > s" $numNodesInGroup nodes in this iteration.") > } > {code} > To avoid this unnecessary warning, we should change the code like this: > {code:java} > while (nodeStack.nonEmpty) { > val (treeIndex, node) = nodeStack.top > // Choose subset of features for node (if subsampling). > val featureSubset: Option[Array[Int]] = if > (metadata.subsamplingFeatures) { > Some(SamplingUtils.reservoirSampleAndCount(Range(0, > metadata.numFeatures).iterator, metadata.numFeaturesPerNode, > rng.nextLong())._1) > } else { > None > } > // Check if enough memory remains to add this node to the group. > val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, > featureSubset) * 8L > if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { > nodeStack.pop() > mutableNodesForGroup.getOrElseUpdate(treeIndex, new > mutable.ArrayBuffer[LearningNode]()) += > node > mutableTreeToNodeToIndexInfo > .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, > NodeIndexInfo]())(node.id) > = new NodeIndexInfo(numNodesInGroup, featureSubset) > numNodesInGroup += 1 > memUsage += nodeMemUsage > } else { > break >} > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21638) Warning message of RF is not accurate
[ https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114153#comment-16114153 ] Peng Meng commented on SPARK-21638: --- In the example warning message, the split node shoud be 2621; > Warning message of RF is not accurate > - > > Key: SPARK-21638 > URL: https://issues.apache.org/jira/browse/SPARK-21638 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: >Reporter: Peng Meng >Priority: Minor > > When train RF model, there is many warning message like this: > {quote}WARN RandomForest: Tree learning is using approximately 268492800 > bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. > This allows splitting 2622 nodes in this iteration.{quote} > This warning message is unnecessary and the data is not accurate. > This is because > {code:java} > while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { > val (treeIndex, node) = nodeStack.top > // Choose subset of features for node (if subsampling). > val featureSubset: Option[Array[Int]] = if > (metadata.subsamplingFeatures) { > Some(SamplingUtils.reservoirSampleAndCount(Range(0, > metadata.numFeatures).iterator, metadata.numFeaturesPerNode, > rng.nextLong())._1) > } else { > None > } > // Check if enough memory remains to add this node to the group. > val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, > featureSubset) * 8L > if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { > nodeStack.pop() > mutableNodesForGroup.getOrElseUpdate(treeIndex, new > mutable.ArrayBuffer[LearningNode]()) += > node > mutableTreeToNodeToIndexInfo > .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, > NodeIndexInfo]())(node.id) > = new NodeIndexInfo(numNodesInGroup, featureSubset) > } > numNodesInGroup += 1 *//we not add the node to mutableNodesForGroup, > but we add memUsage here.* > memUsage += nodeMemUsage > } > if (memUsage > maxMemoryUsage) { > // If maxMemoryUsage is 0, we should still allow splitting 1 node. > logWarning(s"Tree learning is using approximately $memUsage bytes per > iteration, which" + > s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This > allows splitting" + > s" $numNodesInGroup nodes in this iteration.") > } > {code} > To avoid this unnecessary warning, we should change the code like this: > {code:java} > while (nodeStack.nonEmpty) { > val (treeIndex, node) = nodeStack.top > // Choose subset of features for node (if subsampling). > val featureSubset: Option[Array[Int]] = if > (metadata.subsamplingFeatures) { > Some(SamplingUtils.reservoirSampleAndCount(Range(0, > metadata.numFeatures).iterator, metadata.numFeaturesPerNode, > rng.nextLong())._1) > } else { > None > } > // Check if enough memory remains to add this node to the group. > val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, > featureSubset) * 8L > if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { > nodeStack.pop() > mutableNodesForGroup.getOrElseUpdate(treeIndex, new > mutable.ArrayBuffer[LearningNode]()) += > node > mutableTreeToNodeToIndexInfo > .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, > NodeIndexInfo]())(node.id) > = new NodeIndexInfo(numNodesInGroup, featureSubset) > numNodesInGroup += 1 > memUsage += nodeMemUsage > } else { > break >} > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21638) Warning message of RF is not accurate
[ https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114133#comment-16114133 ] Peng Meng commented on SPARK-21638: --- I will be back home now, will answer your question next week. Thanks [~srowen] > Warning message of RF is not accurate > - > > Key: SPARK-21638 > URL: https://issues.apache.org/jira/browse/SPARK-21638 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: >Reporter: Peng Meng >Priority: Minor > > When train RF model, there is many warning message like this: > {quote}WARN RandomForest: Tree learning is using approximately 268492800 > bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. > This allows splitting 2622 nodes in this iteration.{quote} > This warning message is unnecessary and the data is not accurate. > This is because > {code:java} > while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { > val (treeIndex, node) = nodeStack.top > // Choose subset of features for node (if subsampling). > val featureSubset: Option[Array[Int]] = if > (metadata.subsamplingFeatures) { > Some(SamplingUtils.reservoirSampleAndCount(Range(0, > metadata.numFeatures).iterator, metadata.numFeaturesPerNode, > rng.nextLong())._1) > } else { > None > } > // Check if enough memory remains to add this node to the group. > val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, > featureSubset) * 8L > if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { > nodeStack.pop() > mutableNodesForGroup.getOrElseUpdate(treeIndex, new > mutable.ArrayBuffer[LearningNode]()) += > node > mutableTreeToNodeToIndexInfo > .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, > NodeIndexInfo]())(node.id) > = new NodeIndexInfo(numNodesInGroup, featureSubset) > } > numNodesInGroup += 1 *//we not add the node to mutableNodesForGroup, > but we add memUsage here.* > memUsage += nodeMemUsage > } > if (memUsage > maxMemoryUsage) { > // If maxMemoryUsage is 0, we should still allow splitting 1 node. > logWarning(s"Tree learning is using approximately $memUsage bytes per > iteration, which" + > s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This > allows splitting" + > s" $numNodesInGroup nodes in this iteration.") > } > {code} > To avoid this unnecessary warning, we should change the code like this: > {code:java} > while (nodeStack.nonEmpty) { > val (treeIndex, node) = nodeStack.top > // Choose subset of features for node (if subsampling). > val featureSubset: Option[Array[Int]] = if > (metadata.subsamplingFeatures) { > Some(SamplingUtils.reservoirSampleAndCount(Range(0, > metadata.numFeatures).iterator, metadata.numFeaturesPerNode, > rng.nextLong())._1) > } else { > None > } > // Check if enough memory remains to add this node to the group. > val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, > featureSubset) * 8L > if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { > nodeStack.pop() > mutableNodesForGroup.getOrElseUpdate(treeIndex, new > mutable.ArrayBuffer[LearningNode]()) += > node > mutableTreeToNodeToIndexInfo > .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, > NodeIndexInfo]())(node.id) > = new NodeIndexInfo(numNodesInGroup, featureSubset) > numNodesInGroup += 1 > memUsage += nodeMemUsage > } else { > break >} > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21638) Warning message of RF is not accurate
[ https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114132#comment-16114132 ] Peng Meng commented on SPARK-21638: --- This is because "we not add the node to mutableNodesForGroup, but we add memUsage" in current implemantation. This logic is not accurate. > Warning message of RF is not accurate > - > > Key: SPARK-21638 > URL: https://issues.apache.org/jira/browse/SPARK-21638 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: >Reporter: Peng Meng >Priority: Minor > > When train RF model, there is many warning message like this: > {quote}WARN RandomForest: Tree learning is using approximately 268492800 > bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. > This allows splitting 2622 nodes in this iteration.{quote} > This warning message is unnecessary and the data is not accurate. > This is because > {code:java} > while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { > val (treeIndex, node) = nodeStack.top > // Choose subset of features for node (if subsampling). > val featureSubset: Option[Array[Int]] = if > (metadata.subsamplingFeatures) { > Some(SamplingUtils.reservoirSampleAndCount(Range(0, > metadata.numFeatures).iterator, metadata.numFeaturesPerNode, > rng.nextLong())._1) > } else { > None > } > // Check if enough memory remains to add this node to the group. > val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, > featureSubset) * 8L > if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { > nodeStack.pop() > mutableNodesForGroup.getOrElseUpdate(treeIndex, new > mutable.ArrayBuffer[LearningNode]()) += > node > mutableTreeToNodeToIndexInfo > .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, > NodeIndexInfo]())(node.id) > = new NodeIndexInfo(numNodesInGroup, featureSubset) > } > numNodesInGroup += 1 *//we not add the node to mutableNodesForGroup, > but we add memUsage here.* > memUsage += nodeMemUsage > } > if (memUsage > maxMemoryUsage) { > // If maxMemoryUsage is 0, we should still allow splitting 1 node. > logWarning(s"Tree learning is using approximately $memUsage bytes per > iteration, which" + > s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This > allows splitting" + > s" $numNodesInGroup nodes in this iteration.") > } > {code} > To avoid this unnecessary warning, we should change the code like this: > {code:java} > while (nodeStack.nonEmpty) { > val (treeIndex, node) = nodeStack.top > // Choose subset of features for node (if subsampling). > val featureSubset: Option[Array[Int]] = if > (metadata.subsamplingFeatures) { > Some(SamplingUtils.reservoirSampleAndCount(Range(0, > metadata.numFeatures).iterator, metadata.numFeaturesPerNode, > rng.nextLong())._1) > } else { > None > } > // Check if enough memory remains to add this node to the group. > val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, > featureSubset) * 8L > if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { > nodeStack.pop() > mutableNodesForGroup.getOrElseUpdate(treeIndex, new > mutable.ArrayBuffer[LearningNode]()) += > node > mutableTreeToNodeToIndexInfo > .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, > NodeIndexInfo]())(node.id) > = new NodeIndexInfo(numNodesInGroup, featureSubset) > numNodesInGroup += 1 > memUsage += nodeMemUsage > } else { > break >} > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21638) Warning message of RF is not accurate
[ https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-21638: -- Description: When train RF model, there is many warning message like this: {quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration.{quote} This warning message is unnecessary and the data is not accurate. This is because {code:java} while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { val (treeIndex, node) = nodeStack.top // Choose subset of features for node (if subsampling). val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) { Some(SamplingUtils.reservoirSampleAndCount(Range(0, metadata.numFeatures).iterator, metadata.numFeaturesPerNode, rng.nextLong())._1) } else { None } // Check if enough memory remains to add this node to the group. val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, featureSubset) * 8L if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { nodeStack.pop() mutableNodesForGroup.getOrElseUpdate(treeIndex, new mutable.ArrayBuffer[LearningNode]()) += node mutableTreeToNodeToIndexInfo .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, NodeIndexInfo]())(node.id) = new NodeIndexInfo(numNodesInGroup, featureSubset) } numNodesInGroup += 1 *//we not add the node to mutableNodesForGroup, but we add memUsage here.* memUsage += nodeMemUsage } if (memUsage > maxMemoryUsage) { // If maxMemoryUsage is 0, we should still allow splitting 1 node. logWarning(s"Tree learning is using approximately $memUsage bytes per iteration, which" + s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This allows splitting" + s" $numNodesInGroup nodes in this iteration.") } {code} To avoid this unnecessary warning, we should change the code like this: {code:java} while (nodeStack.nonEmpty) { val (treeIndex, node) = nodeStack.top // Choose subset of features for node (if subsampling). val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) { Some(SamplingUtils.reservoirSampleAndCount(Range(0, metadata.numFeatures).iterator, metadata.numFeaturesPerNode, rng.nextLong())._1) } else { None } // Check if enough memory remains to add this node to the group. val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, featureSubset) * 8L if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { nodeStack.pop() mutableNodesForGroup.getOrElseUpdate(treeIndex, new mutable.ArrayBuffer[LearningNode]()) += node mutableTreeToNodeToIndexInfo .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, NodeIndexInfo]())(node.id) = new NodeIndexInfo(numNodesInGroup, featureSubset) numNodesInGroup += 1 memUsage += nodeMemUsage } else { break } } {code} was: When train RF model, there is many warning message like this: {quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration.{quote} This warning message is unnecessary and the data is not accurate. This is because {code:java} while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { val (treeIndex, node) = nodeStack.top // Choose subset of features for node (if subsampling). val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) { Some(SamplingUtils.reservoirSampleAndCount(Range(0, metadata.numFeatures).iterator, metadata.numFeaturesPerNode, rng.nextLong())._1) } else { None } // Check if enough memory remains to add this node to the group. val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, featureSubset) * 8L if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { nodeStack.pop() mutableNodesForGroup.getOrElseUpdate(treeIndex, new mutable.ArrayBuffer[LearningNode]()) += node mutableTreeToNodeToIndexInfo .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, NodeIndexInfo]())(node.id) = new NodeIndexInfo(numNodesInGroup, featureSubset) } numNodesInGroup += 1 *//we not add the node to mutableNodesForGroup, but we add memUsage here.* memUsage += nodeMemUsage } if (memUsage > maxMemoryUsage) { // If maxMemoryUsage is 0, we should still allow splitting 1 node. logWarning(s"Tree learning is using ap
[jira] [Updated] (SPARK-21638) Warning message of RF is not accurate
[ https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-21638: -- Description: When train RF model, there is many warning message like this: {quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration.{quote} This warning message is unnecessary and the data is not accurate. This is because {code:java} while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { val (treeIndex, node) = nodeStack.top // Choose subset of features for node (if subsampling). val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) { Some(SamplingUtils.reservoirSampleAndCount(Range(0, metadata.numFeatures).iterator, metadata.numFeaturesPerNode, rng.nextLong())._1) } else { None } // Check if enough memory remains to add this node to the group. val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, featureSubset) * 8L if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { nodeStack.pop() mutableNodesForGroup.getOrElseUpdate(treeIndex, new mutable.ArrayBuffer[LearningNode]()) += node mutableTreeToNodeToIndexInfo .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, NodeIndexInfo]())(node.id) = new NodeIndexInfo(numNodesInGroup, featureSubset) } numNodesInGroup += 1 *//we not add the node to mutableNodesForGroup, but we add memUsage here.* memUsage += nodeMemUsage } if (memUsage > maxMemoryUsage) { // If maxMemoryUsage is 0, we should still allow splitting 1 node. logWarning(s"Tree learning is using approximately $memUsage bytes per iteration, which" + s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This allows splitting" + s" $numNodesInGroup nodes in this iteration.") } {code} To avoid this unnecessary warning, we should change the code like this: {code:java} while (nodeStack.nonEmpty) { val (treeIndex, node) = nodeStack.top // Choose subset of features for node (if subsampling). val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) { Some(SamplingUtils.reservoirSampleAndCount(Range(0, metadata.numFeatures).iterator, metadata.numFeaturesPerNode, rng.nextLong())._1) } else { None } // Check if enough memory remains to add this node to the group. val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, featureSubset) * 8L if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { nodeStack.pop() mutableNodesForGroup.getOrElseUpdate(treeIndex, new mutable.ArrayBuffer[LearningNode]()) += node mutableTreeToNodeToIndexInfo .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, NodeIndexInfo]())(node.id) = new NodeIndexInfo(numNodesInGroup, featureSubset) numNodesInGroup += 1 //we not add the node to mutableNodesForGroup, but we add memUsage here. memUsage += nodeMemUsage } else { break } } {code} was: When train RF model, there is many warning message like this: {quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration.{quote} This warning message is unnecessary and the data is not accuracy. This is because {code:java} while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { val (treeIndex, node) = nodeStack.top // Choose subset of features for node (if subsampling). val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) { Some(SamplingUtils.reservoirSampleAndCount(Range(0, metadata.numFeatures).iterator, metadata.numFeaturesPerNode, rng.nextLong())._1) } else { None } // Check if enough memory remains to add this node to the group. val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, featureSubset) * 8L if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { nodeStack.pop() mutableNodesForGroup.getOrElseUpdate(treeIndex, new mutable.ArrayBuffer[LearningNode]()) += node mutableTreeToNodeToIndexInfo .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, NodeIndexInfo]())(node.id) = new NodeIndexInfo(numNodesInGroup, featureSubset) } numNodesInGroup += 1 *//we not add the node to mutableNodesForGroup, but we add memUsage here.* memUsage += nodeMemUsage } if (memUsage > maxMemoryUsage) { // If maxMemoryUsage is 0, we should
[jira] [Created] (SPARK-21638) Warning message of RF is not accurate
Peng Meng created SPARK-21638: - Summary: Warning message of RF is not accurate Key: SPARK-21638 URL: https://issues.apache.org/jira/browse/SPARK-21638 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Environment: Reporter: Peng Meng Priority: Minor When train RF model, there is many warning message like this: {quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration.{quote} This warning message is unnecessary and the data is not accuracy. This is because {code:java} while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { val (treeIndex, node) = nodeStack.top // Choose subset of features for node (if subsampling). val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) { Some(SamplingUtils.reservoirSampleAndCount(Range(0, metadata.numFeatures).iterator, metadata.numFeaturesPerNode, rng.nextLong())._1) } else { None } // Check if enough memory remains to add this node to the group. val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, featureSubset) * 8L if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { nodeStack.pop() mutableNodesForGroup.getOrElseUpdate(treeIndex, new mutable.ArrayBuffer[LearningNode]()) += node mutableTreeToNodeToIndexInfo .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, NodeIndexInfo]())(node.id) = new NodeIndexInfo(numNodesInGroup, featureSubset) } numNodesInGroup += 1 *//we not add the node to mutableNodesForGroup, but we add memUsage here.* memUsage += nodeMemUsage } if (memUsage > maxMemoryUsage) { // If maxMemoryUsage is 0, we should still allow splitting 1 node. logWarning(s"Tree learning is using approximately $memUsage bytes per iteration, which" + s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This allows splitting" + s" $numNodesInGroup nodes in this iteration.") } {code} To avoid this unnecessary warning, we should change the code like this: {code:java} while (nodeStack.nonEmpty) { val (treeIndex, node) = nodeStack.top // Choose subset of features for node (if subsampling). val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) { Some(SamplingUtils.reservoirSampleAndCount(Range(0, metadata.numFeatures).iterator, metadata.numFeaturesPerNode, rng.nextLong())._1) } else { None } // Check if enough memory remains to add this node to the group. val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, featureSubset) * 8L if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { nodeStack.pop() mutableNodesForGroup.getOrElseUpdate(treeIndex, new mutable.ArrayBuffer[LearningNode]()) += node mutableTreeToNodeToIndexInfo .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, NodeIndexInfo]())(node.id) = new NodeIndexInfo(numNodesInGroup, featureSubset) numNodesInGroup += 1 //we not add the node to mutableNodesForGroup, but we add memUsage here. memUsage += nodeMemUsage } else { break } } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21624) Optimize communication cost of RF/GBT/DT
[ https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113793#comment-16113793 ] Peng Meng commented on SPARK-21624: --- Thanks [~mlnick], use Vector and compress is reasonable. I will submit a PR and show the performance data. Thanks. > Optimize communication cost of RF/GBT/DT > > > Key: SPARK-21624 > URL: https://issues.apache.org/jira/browse/SPARK-21624 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > {quote}The implementation of RF is bound by either the cost of statistics > computation on workers or by communicating the sufficient statistics.{quote} > The statistics are stored in allStats: > {code:java} > /** >* Flat array of elements. >* Index for start of stats for a (feature, bin) is: >* index = featureOffsets(featureIndex) + binIndex * statsSize >*/ > private var allStats: Array[Double] = new Array[Double](allStatsSize) > {code} > The size of allStats maybe very large, and it can be very spare, especially > on the nodes that near the leave of the tree. > I have changed allStats from Array to SparseVector, my tests show the > communication is down by about 50%. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21624) Optimize communication cost of RF/GBT/DT
[ https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112374#comment-16112374 ] Peng Meng commented on SPARK-21624: --- ping [~josephkb] [~srowen] [~yanboliang] [~mlnick] [~yuhaoyan] > Optimize communication cost of RF/GBT/DT > > > Key: SPARK-21624 > URL: https://issues.apache.org/jira/browse/SPARK-21624 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > {quote}The implementation of RF is bound by either the cost of statistics > computation on workers or by communicating the sufficient statistics.{quote} > The statistics are stored in allStats: > {code:java} > /** >* Flat array of elements. >* Index for start of stats for a (feature, bin) is: >* index = featureOffsets(featureIndex) + binIndex * statsSize >*/ > private var allStats: Array[Double] = new Array[Double](allStatsSize) > {code} > The size of allStats maybe very large, and it can be very spare, especially > on the nodes that near the leave of the tree. > I have changed allStats from Array to SparseVector, my tests show the > communication is down by about 50%. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21624) Optimize communication cost of RF/GBT/DT
Peng Meng created SPARK-21624: - Summary: Optimize communication cost of RF/GBT/DT Key: SPARK-21624 URL: https://issues.apache.org/jira/browse/SPARK-21624 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.3.0 Reporter: Peng Meng {quote}The implementation of RF is bound by either the cost of statistics computation on workers or by communicating the sufficient statistics.{quote} The statistics are stored in allStats: {code:java} /** * Flat array of elements. * Index for start of stats for a (feature, bin) is: * index = featureOffsets(featureIndex) + binIndex * statsSize */ private var allStats: Array[Double] = new Array[Double](allStatsSize) {code} The size of allStats maybe very large, and it can be very spare, especially on the nodes that near the leave of the tree. I have changed allStats from Array to SparseVector, my tests show the communication is down by about 50%. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21623) Comments of parentStats on ml/tree/impl/DTStatsAggregator.scala is wrong
Peng Meng created SPARK-21623: - Summary: Comments of parentStats on ml/tree/impl/DTStatsAggregator.scala is wrong Key: SPARK-21623 URL: https://issues.apache.org/jira/browse/SPARK-21623 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Peng Meng Priority: Minor {code:java} * Note: this is necessary because stats for the parent node are not available * on the first iteration of tree learning. */ private val parentStats: Array[Double] = new Array[Double](statsSize) {code} This comment is not right. Actually, parentStats is not only used for the first iteration. It is used with all the iteration for unordered featrues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101476#comment-16101476 ] Peng Meng commented on SPARK-21476: --- Hi [~sagraw], could you please test copy pasted the transform method from ProbabilisticClassifier into RandomForestClassificationModel, and added broadcasting inside transform. Thanks. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101476#comment-16101476 ] Peng Meng edited comment on SPARK-21476 at 7/26/17 10:06 AM: - Hi [~sagraw], could you please test copy pasted the transform method from ProbabilisticClassificationModel into RandomForestClassificationModel, and added broadcasting inside transform. Thanks. was (Author: peng.m...@intel.com): Hi [~sagraw], could you please test copy pasted the transform method from ProbabilisticClassifier into RandomForestClassificationModel, and added broadcasting inside transform. Thanks. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101253#comment-16101253 ] Peng Meng commented on SPARK-21476: --- Not each transform uses broadcast, do you have some experiment data shows broadcast is a problem here. Thanks. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099593#comment-16099593 ] Peng Meng edited comment on SPARK-21476 at 7/25/17 6:55 AM: Hi @Suarabh, I am profiling RF transform performance. I change transform to use transformImpl which uses broadcast, but find there is no performance improvement. Could you show me what is your case? For example, the tree size, number of features, dataset partitions, and number of executors. Thanks. was (Author: peng.m...@intel.com): Hi @Suarabh, I am profiling RF transform performance. I change transform to use transformImpl which uses broadcase, but find there is no performance improvement. Could you show me what is your case? For example, the tree size, number of features, dataset partitions, and number of executors. Thanks. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099593#comment-16099593 ] Peng Meng commented on SPARK-21476: --- Hi @Suarabh, I am profiling RF transform performance. I change transform to use transformImpl which uses broadcase, but find there is no performance improvement. Could you show me what is your case? For example, the tree size, number of features, dataset partitions, and number of executors. Thanks. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094603#comment-16094603 ] Peng Meng commented on SPARK-21476: --- I am optimizing RF and GBT these days, if no one works on it. I can working on it. Thanks. > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform
[ https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094266#comment-16094266 ] Peng Meng commented on SPARK-21476: --- Seems transform should use transformImpl but not use? > RandomForest classification model not using broadcast in transform > -- > > Key: SPARK-21476 > URL: https://issues.apache.org/jira/browse/SPARK-21476 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Saurabh Agrawal > > I notice significant task deserialization latency while running prediction > with pipelines using RandomForestClassificationModel. While digging into the > source, found that the transform method in RandomForestClassificationModel > binds to its parent ProbabilisticClassificationModel and the only concrete > definition that RandomForestClassificationModel provides and which is > actually used in transform is that of predictRaw. Broadcasting is not being > used in predictRaw. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS
[ https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094106#comment-16094106 ] Peng Meng commented on SPARK-2465: -- I think it is time to revisit this now. Some of our customers, such as JD.com, ask us to support Long ID for ALS. Actually, they have more than Int.MaxValue products. Long ID of ALS is necessary for them. How to you think to reopen your PR? [~srowen] > Use long as user / item ID for ALS > -- > > Key: SPARK-2465 > URL: https://issues.apache.org/jira/browse/SPARK-2465 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.1 >Reporter: Sean Owen >Priority: Minor > Attachments: ALS using MEMORY_AND_DISK.png, ALS using > MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png > > > I'd like to float this for consideration: use longs instead of ints for user > and product IDs in the ALS implementation. > The main reason for is that identifiers are not generally numeric at all, and > will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits > means collisions are likely after hundreds of thousands of users and items, > which is not unrealistic. Hashing to 64 bits pushes this back to billions. > It would also mean numeric IDs that happen to be larger than the largest int > can be used directly as identifiers. > On the downside of course: 8 bytes instead of 4 bytes of memory used per > Rating. > Thoughts? I will post a PR so as to show what the change would be. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089656#comment-16089656 ] Peng Meng commented on SPARK-21401: --- Got it, thanks [~srowen] > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089648#comment-16089648 ] Peng Meng commented on SPARK-21401: --- Yes, you don't need to do it for the vast majority of elements. The key problem here is boxing and unboxing. My testing show boxing/unboxing consuming most of time. My micro test even shows 80% time is spending on boxing/unboxing. If redesign BoundedPriorityQueue can avoid boxing/unboxing, it will be very useful. > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089638#comment-16089638 ] Peng Meng commented on SPARK-21401: --- I mean we totally rewrite the BoundedPriorityQueue, not use Java PriorityQueue or Scala PriorityQueue. To create a very performance one. > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089629#comment-16089629 ] Peng Meng commented on SPARK-21401: --- Thanks @srowen. I mean for BoundedPriorityQueue, you also can offer it directly. No need to work like this {quote} private def maybeReplaceLowest(a: A): Boolean = { val head = underlying.peek() if (head != null && ord.gt(a, head)) { underlying.poll() underlying.offer(a) } else { false } } {quote} poll then offer. poll and offer both need to reheap. Actually, this doesn't need reheap two times. just replace the first element and reheap is ok. (We should rewrite the code of BoundedPriorityQueue to do that) > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089606#comment-16089606 ] Peng Meng commented on SPARK-21401: --- I think the BoundedPriorityQueue should be rewritten. there are two key problems: 1) If insert an element: you should poll first, then offer. reheap two times. Actually, we just need one if rewritten the code. 2) Now, BoundedPriorityQueue use Java PriorityQueue, there are many boxing/unboxing, which is very time consuming. How do you think about it? > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089586#comment-16089586 ] Peng Meng edited comment on SPARK-21401 at 7/17/17 10:10 AM: - I have tested much about poll and toArray.sorted. If the queue is much ordered (suppose offer 2000 times for queue size 20). Use pq.toArray.sorted is faster. If the queue is much disordered (suppose offer 100 times for queue size 20), Use pq.poll is much faster. So why not keep the poll function for BoundedPriorityQueue? was (Author: peng.m...@intel.com): I have tested much about poll and toArray.sorted. If the queue is ordered. Use pq.toArray.sorted is faster. If the queue is much disordered, Use pq.poll is much faster. So why not keep the poll function for BoundedPriorityQueue? > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089586#comment-16089586 ] Peng Meng commented on SPARK-21401: --- I have tested much about poll and toArray.sorted. If the queue is ordered. Use pq.toArray.sorted is faster. If the queue is much disordered, Use pq.poll is much faster. So why not keep the poll function for BoundedPriorityQueue? > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089578#comment-16089578 ] Peng Meng commented on SPARK-21401: --- Hi [~srowen], I got why my original test pq.toArray.sorted is very slow. My original test for sorted is using: {quote}pq.toArray.sorted(Ordering.By[(Int, Double), Double](-_._2)) {quote}, because {quote} pq.toArray.sorted(-_._2) {quote} build error. Maybe there is boxing/unboxing, the performance is very bad. Now, I use {quote}pq.toArray.sortBy(-_._2) {quote}, the performance is good than poll. this 25s vs poll 26s. Thanks. > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089437#comment-16089437 ] Peng Meng commented on SPARK-21401: --- I benchmarking just change pq.toArray.sorted. and pq.poll. pq.poll time complexity is always log(N). How about the time complexity of quicksort when the data is already partially ordered. > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089424#comment-16089424 ] Peng Meng commented on SPARK-21401: --- Hi [~srowen], for ALS optimization, the difference of using poll and sorted is large: When num = 20, if use sorted here, the prediction time is about 31s, if use poll, the prediction time is about 26s. I think this difference is large. I have tested many times. The result is about the same. https://github.com/apache/spark/pull/18624. I am testing LDA of changing sorted to poll. > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089209#comment-16089209 ] Peng Meng commented on SPARK-21401: --- Hi [~srowen], here we also want to get a fully sorted list by get the the top element N (pq.size) times. In my test, call pq.poll N times is much faster than pq.toArray.sorted. > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085874#comment-16085874 ] Peng Meng commented on SPARK-21401: --- Sure, I will add isEmpty and maybe some other functions, and tests cases. Thanks. > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085738#comment-16085738 ] Peng Meng commented on SPARK-21401: --- Yes, SPARK-21389 used pq.poll. pq.poll is just a small part of SPARK-21389, and pq.poll can improve the performance of SPARK-21389 by 10% comparing with pq.toAarry.sorted. Maybe other places where use pq.toAarry.sorted also can use pq.poll to improve performance. So I create this JIRA. > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21401) add poll function for BoundedPriorityQueue
[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-21401: -- Description: The most of BoundedPriorityQueue usages in ML/MLLIB are: Get the value of BoundedPriorityQueue, then sort it. For example, in Word2Vec: pq.toSeq.sortBy(-_._2) in ALS, pq.toArray.sorted() The test results show using pq.poll() is much faster than sort the value. For example, in PR https://github.com/apache/spark/pull/18624 We get the sorted value of pq by the following code: {quote} var size = pq.size while(size > 0) { size -= 1 val factor = pq.poll }{quote} If using the generally used methods: pq.toArray.sorted() to get the sorted value of pq. There is about 10% performance reduction. It is good to add the poll function for BoundedPriorityQueue, since many usages of PQ need the sorted value. was: The most of BoundedPriorityQueue usages in ML/MLLIB are: Get the value of BoundedPriorityQueue, then sort it. For example, in Word2Vec: pq.toSeq.sortBy(-_._2) in ALS, pq.toArray.sorted() The test results show using pq.poll() is much faster than sort the value. For example, in PR https://github.com/apache/spark/pull/18624 We get the sorted value of pq by the following code: {quote} var size = pq.size while(size > 0) { size -= 1 val factor = pq.poll }{quote} If using the generally used methods: pq.toArray.sorted() to get the sorted value of pq. There is about 10% performance reduction. It is good to add the poll function for BoundedPriorityQueue, since many usages of PQ need the sorted value. > add poll function for BoundedPriorityQueue > -- > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21401) add poll function for BoundedPriorityQueue
Peng Meng created SPARK-21401: - Summary: add poll function for BoundedPriorityQueue Key: SPARK-21401 URL: https://issues.apache.org/jira/browse/SPARK-21401 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.3.0 Reporter: Peng Meng Priority: Minor The most of BoundedPriorityQueue usages in ML/MLLIB are: Get the value of BoundedPriorityQueue, then sort it. For example, in Word2Vec: pq.toSeq.sortBy(-_._2) in ALS, pq.toArray.sorted() The test results show using pq.poll() is much faster than sort the value. For example, in PR https://github.com/apache/spark/pull/18624 We get the sorted value of pq by the following code: {quote} var size = pq.size while(size > 0) { size -= 1 val factor = pq.poll }{quote} If using the generally used methods: pq.toArray.sorted() to get the sorted value of pq. There is about 10% performance reduction. It is good to add the poll function for BoundedPriorityQueue, since many usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21389) ALS recommendForAll optimization uses Native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-21389: -- Description: In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting matrix multiplication, and get the topK items for each matrix. The method effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, and OpenBLAS, the performance of matrix multiplication is about 10X comparing with handwriting method. I have rewritten the code of recommendForAll with GEMM, and got about 50% improvement comparing with the master recommendForAll method. The key point of this optimization: 1), use GEMM to replace hand-written matrix multiplication. 2), Use matrix to keep temp result, largely reduce GC and computing time. The master method create many small objects, which causes using GEMM directly cannot get good performance. 3), Use sort and merge to get the topK items, which don't need to call priority queue two times. Test Result: 479818 users, 13727 products, rank = 10, topK = 20. 3 workers, each with 35 cores. Native BLAS is Intel MKL. Block Size: 1000===2000===4000===8000 Master Method:40s==39.4s===39.5s===39.1s This Method 26.5s==25.9s===26s===27.1s Performance Improvement: (OldTime - NewTime)/NewTime = about 50% was: In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting matrix multiplication, and get the topK items for each matrix. The method effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, and OpenBLAS, the performance of matrix multiplication is about 10X comparing with handwriting method. I have rewritten the code of recommendForAll with GEMM, and got about 50% improvement comparing with the master recommendForAll method. The key point of this optimization: 1), use GEMM to replace hand-written matrix multiplication. 2), Use matrix to keep temp result, largely reduce GC and computing time. The master method create many small objects, which causes using GEMM directly cannot get good performance. 3), Use sort and merge to get the topK items, which don't need to call priority queue two times. Test Result: 479818 users, 13727 products, rank = 10, topK = 20. 3 workers, each with 35 cores. Native BLAS is Intel MKL. Block Size: 1000===2000===4000===8000 Master Method:40s-39.4s-39.5s39.1s This Method 26.5s---25.9s26s-27.1s Performance Improvement: (OldTime - NewTime)/NewTime = about 50% > ALS recommendForAll optimization uses Native BLAS > - > > Key: SPARK-21389 > URL: https://issues.apache.org/jira/browse/SPARK-21389 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > Original Estimate: 168h > Remaining Estimate: 168h > > In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting > matrix multiplication, and get the topK items for each matrix. The method > effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, > and OpenBLAS, the performance of matrix multiplication is about 10X comparing > with handwriting method. > I have rewritten the code of recommendForAll with GEMM, and got about 50% > improvement comparing with the master recommendForAll method. > The key point of this optimization: > 1), use GEMM to replace hand-written matrix multiplication. > 2), Use matrix to keep temp result, largely reduce GC and computing time. The > master method create many small objects, which causes using GEMM directly > cannot get good performance. > 3), Use sort and merge to get the topK items, which don't need to call > priority queue two times. > Test Result: > 479818 users, 13727 products, rank = 10, topK = 20. > 3 workers, each with 35 cores. Native BLAS is Intel MKL. > Block Size: 1000===2000===4000===8000 > Master Method:40s==39.4s===39.5s===39.1s > This Method 26.5s==25.9s===26s===27.1s > Performance Improvement: (OldTime - NewTime)/NewTime = about 50% -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21389) ALS recommendForAll optimization uses Native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-21389: -- Description: In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting matrix multiplication, and get the topK items for each matrix. The method effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, and OpenBLAS, the performance of matrix multiplication is about 10X comparing with handwriting method. I have rewritten the code of recommendForAll with GEMM, and got about 50% improvement comparing with the master recommendForAll method. The key point of this optimization: 1), use GEMM to replace hand-written matrix multiplication. 2), Use matrix to keep temp result, largely reduce GC and computing time. The master method create many small objects, which causes using GEMM directly cannot get good performance. 3), Use sort and merge to get the topK items, which don't need to call priority queue two times. Test Result: 479818 users, 13727 products, rank = 10, topK = 20. 3 workers, each with 35 cores. Native BLAS is Intel MKL. Block Size: 1000===2000===4000===8000 Master Method:40s-39.4s-39.5s39.1s This Method 26.5s---25.9s26s-27.1s Performance Improvement: (OldTime - NewTime)/NewTime = about 50% was: In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting matrix multiplication, and get the topK items for each matrix. The method effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, and OpenBLAS, the performance of matrix multiplication is about 10X comparing with handwriting method. I have rewritten the code of recommendForAll with GEMM, and got about 20%-30% improvement comparing with the master recommendForAll method. Will clean the code and submit for discussion. > ALS recommendForAll optimization uses Native BLAS > - > > Key: SPARK-21389 > URL: https://issues.apache.org/jira/browse/SPARK-21389 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > Original Estimate: 168h > Remaining Estimate: 168h > > In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting > matrix multiplication, and get the topK items for each matrix. The method > effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, > and OpenBLAS, the performance of matrix multiplication is about 10X comparing > with handwriting method. > I have rewritten the code of recommendForAll with GEMM, and got about 50% > improvement comparing with the master recommendForAll method. > The key point of this optimization: > 1), use GEMM to replace hand-written matrix multiplication. > 2), Use matrix to keep temp result, largely reduce GC and computing time. The > master method create many small objects, which causes using GEMM directly > cannot get good performance. > 3), Use sort and merge to get the topK items, which don't need to call > priority queue two times. > Test Result: > 479818 users, 13727 products, rank = 10, topK = 20. > 3 workers, each with 35 cores. Native BLAS is Intel MKL. > Block Size: 1000===2000===4000===8000 > Master Method:40s-39.4s-39.5s39.1s > This Method 26.5s---25.9s26s-27.1s > Performance Improvement: (OldTime - NewTime)/NewTime = about 50% -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21389) ALS recommendForAll optimization uses Native BLAS
Peng Meng created SPARK-21389: - Summary: ALS recommendForAll optimization uses Native BLAS Key: SPARK-21389 URL: https://issues.apache.org/jira/browse/SPARK-21389 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.3.0 Reporter: Peng Meng In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting matrix multiplication, and get the topK items for each matrix. The method effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, and OpenBLAS, the performance of matrix multiplication is about 10X comparing with handwriting method. I have rewritten the code of recommendForAll with GEMM, and got about 20%-30% improvement comparing with the master recommendForAll method. Will clean the code and submit for discussion. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance
[ https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076152#comment-16076152 ] Peng Meng edited comment on SPARK-21305 at 7/6/17 8:22 AM: --- I tested Intel MKL and OpenBLAS by ALS Train and Prediction. ALS Train uses BLAS SPR, ALS prediction uses BLAS GEMM. For ALS Prediction: OpenBLAS Intel-MKL MT:124 87 ST: 37 34 For ALS Train: OpenBLAS Intel-MKL MT 70 58 ST 63 58 Some of Native BLAS subprograms are not multi-threading, so no mater disable MT or not, the performance will not change. So disable multi-thread will not improve performance of spark JOB which use single thread native BLAS. was (Author: peng.m...@intel.com): I tested Intel MKL and OpenBLAS by ALS Train and Prediction. ALS Train uses BLAS SPR, ALS prediction uses BLAS GEMM. For ALS Prediction: OpenBLAS Intel-MKL MT:124 87 ST: 37 34 For ALS Train: MT 70 58 ST 63 58 Some of Native BLAS subprograms are not multi-threading, so no mater disable MT or not, the performance will not change. So disable multi-thread will not improve performance of spark JOB which use single thread native BLAS. > The BKM (best known methods) of using native BLAS to improvement ML/MLLIB > performance > - > > Key: SPARK-21305 > URL: https://issues.apache.org/jira/browse/SPARK-21305 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Critical > Original Estimate: 504h > Remaining Estimate: 504h > > Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to > improvement the performance. > The methods to use native BLAS is important for the performance, sometimes > (high opportunity) native BLAS even causes worse performance. > For example, for the ALS recommendForAll method before SPARK 2.2 which uses > BLAS gemm for matrix multiplication. > If you only test the matrix multiplication performance of native BLAS gemm > (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native > BLAS is about 10X performance improvement. But if you test the Spark Job > end-to-end performance, F2j is much faster than native BLAS, very > interesting. > I spend much time for this problem, and find we should not use native BLAS > (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. > By default, this native BLAS will enable multi-thread, which will conflict > with Spark executor. You can use multi-thread native BLAS, but it is better > to disable multi-thread first. > https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded > https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications > I think we should add some comments in docs/ml-guilde.md for this first. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance
[ https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076152#comment-16076152 ] Peng Meng commented on SPARK-21305: --- I tested Intel MKL and OpenBLAS by ALS Train and Prediction. ALS Train uses BLAS SPR, ALS prediction uses BLAS GEMM. For ALS Prediction: OpenBLAS Intel-MKL MT:124 87 ST: 37 34 For ALS Train: MT 70 58 ST 63 58 Some of Native BLAS subprograms are not multi-threading, so no mater disable MT or not, the performance will not change. So disable multi-thread will not improve performance of spark JOB which use single thread native BLAS. > The BKM (best known methods) of using native BLAS to improvement ML/MLLIB > performance > - > > Key: SPARK-21305 > URL: https://issues.apache.org/jira/browse/SPARK-21305 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Critical > Original Estimate: 504h > Remaining Estimate: 504h > > Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to > improvement the performance. > The methods to use native BLAS is important for the performance, sometimes > (high opportunity) native BLAS even causes worse performance. > For example, for the ALS recommendForAll method before SPARK 2.2 which uses > BLAS gemm for matrix multiplication. > If you only test the matrix multiplication performance of native BLAS gemm > (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native > BLAS is about 10X performance improvement. But if you test the Spark Job > end-to-end performance, F2j is much faster than native BLAS, very > interesting. > I spend much time for this problem, and find we should not use native BLAS > (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. > By default, this native BLAS will enable multi-thread, which will conflict > with Spark executor. You can use multi-thread native BLAS, but it is better > to disable multi-thread first. > https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded > https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications > I think we should add some comments in docs/ml-guilde.md for this first. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance
[ https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074320#comment-16074320 ] Peng Meng commented on SPARK-21305: --- Thanks [~srowen] and [~yanboliang] I will disable native BLAS MT in spark-env.sh, and upload the performance data. PR will submit soon. > The BKM (best known methods) of using native BLAS to improvement ML/MLLIB > performance > - > > Key: SPARK-21305 > URL: https://issues.apache.org/jira/browse/SPARK-21305 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Critical > Original Estimate: 504h > Remaining Estimate: 504h > > Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to > improvement the performance. > The methods to use native BLAS is important for the performance, sometimes > (high opportunity) native BLAS even causes worse performance. > For example, for the ALS recommendForAll method before SPARK 2.2 which uses > BLAS gemm for matrix multiplication. > If you only test the matrix multiplication performance of native BLAS gemm > (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native > BLAS is about 10X performance improvement. But if you test the Spark Job > end-to-end performance, F2j is much faster than native BLAS, very > interesting. > I spend much time for this problem, and find we should not use native BLAS > (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. > By default, this native BLAS will enable multi-thread, which will conflict > with Spark executor. You can use multi-thread native BLAS, but it is better > to disable multi-thread first. > https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded > https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications > I think we should add some comments in docs/ml-guilde.md for this first. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance
[ https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073801#comment-16073801 ] Peng Meng edited comment on SPARK-21305 at 7/4/17 3:39 PM: --- yes, I will do that. Because different blas, the method to disable multi-thread is also different. I have not found a good way to disable it in spark code. was (Author: peng.m...@intel.com): yes, I will do that. Because different blas, the method to disable multi-thread is also different. I have not found a good way be disable it in spark code. > The BKM (best known methods) of using native BLAS to improvement ML/MLLIB > performance > - > > Key: SPARK-21305 > URL: https://issues.apache.org/jira/browse/SPARK-21305 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Critical > Original Estimate: 504h > Remaining Estimate: 504h > > Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to > improvement the performance. > The methods to use native BLAS is important for the performance, sometimes > (high opportunity) native BLAS even causes worse performance. > For example, for the ALS recommendForAll method before SPARK 2.2 which uses > BLAS gemm for matrix multiplication. > If you only test the matrix multiplication performance of native BLAS gemm > (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native > BLAS is about 10X performance improvement. But if you test the Spark Job > end-to-end performance, F2j is much faster than native BLAS, very > interesting. > I spend much time for this problem, and find we should not use native BLAS > (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. > By default, this native BLAS will enable multi-thread, which will conflict > with Spark executor. You can use multi-thread native BLAS, but it is better > to disable multi-thread first. > https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded > https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications > I think we should add some comments in docs/ml-guilde.md for this first. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance
[ https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073801#comment-16073801 ] Peng Meng commented on SPARK-21305: --- yes, I will do that. Because different blas, the method to disable multi-thread is also different. I have not found a good way be disable it in spark code. > The BKM (best known methods) of using native BLAS to improvement ML/MLLIB > performance > - > > Key: SPARK-21305 > URL: https://issues.apache.org/jira/browse/SPARK-21305 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Critical > Original Estimate: 504h > Remaining Estimate: 504h > > Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to > improvement the performance. > The methods to use native BLAS is important for the performance, sometimes > (high opportunity) native BLAS even causes worse performance. > For example, for the ALS recommendForAll method before SPARK 2.2 which uses > BLAS gemm for matrix multiplication. > If you only test the matrix multiplication performance of native BLAS gemm > (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native > BLAS is about 10X performance improvement. But if you test the Spark Job > end-to-end performance, F2j is much faster than native BLAS, very > interesting. > I spend much time for this problem, and find we should not use native BLAS > (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. > By default, this native BLAS will enable multi-thread, which will conflict > with Spark executor. You can use multi-thread native BLAS, but it is better > to disable multi-thread first. > https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded > https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications > I think we should add some comments in docs/ml-guilde.md for this first. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance
[ https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073798#comment-16073798 ] Peng Meng commented on SPARK-21305: --- ping [~mlnick] , [~yanboliang], [~mengxr], [~srowen] > The BKM (best known methods) of using native BLAS to improvement ML/MLLIB > performance > - > > Key: SPARK-21305 > URL: https://issues.apache.org/jira/browse/SPARK-21305 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Critical > Original Estimate: 504h > Remaining Estimate: 504h > > Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to > improvement the performance. > The methods to use native BLAS is important for the performance, sometimes > (high opportunity) native BLAS even causes worse performance. > For example, for the ALS recommendForAll method before SPARK 2.2 which uses > BLAS gemm for matrix multiplication. > If you only test the matrix multiplication performance of native BLAS gemm > (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native > BLAS is about 10X performance improvement. But if you test the Spark Job > end-to-end performance, F2j is much faster than native BLAS, very > interesting. > I spend much time for this problem, and find we should not use native BLAS > (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. > By default, this native BLAS will enable multi-thread, which will conflict > with Spark executor. You can use multi-thread native BLAS, but it is better > to disable multi-thread first. > https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded > https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications > I think we should add some comments in docs/ml-guilde.md for this first. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance
Peng Meng created SPARK-21305: - Summary: The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance Key: SPARK-21305 URL: https://issues.apache.org/jira/browse/SPARK-21305 Project: Spark Issue Type: Umbrella Components: ML, MLlib Affects Versions: 2.3.0 Reporter: Peng Meng Priority: Critical Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to improvement the performance. The methods to use native BLAS is important for the performance, sometimes (high opportunity) native BLAS even causes worse performance. For example, for the ALS recommendForAll method before SPARK 2.2 which uses BLAS gemm for matrix multiplication. If you only test the matrix multiplication performance of native BLAS gemm (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native BLAS is about 10X performance improvement. But if you test the Spark Job end-to-end performance, F2j is much faster than native BLAS, very interesting. I spend much time for this problem, and find we should not use native BLAS (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. By default, this native BLAS will enable multi-thread, which will conflict with Spark executor. You can use multi-thread native BLAS, but it is better to disable multi-thread first. https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications I think we should add some comments in docs/ml-guilde.md for this first. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-20443: -- Attachment: blockSize.jpg blockSize and the performance of ALS recommendForAll > The blockSize of MLLIB ALS should be setting by the User > - > > Key: SPARK-20443 > URL: https://issues.apache.org/jira/browse/SPARK-20443 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > Attachments: blockSize.jpg > > > The blockSize of MLLIB ALS is very important for ALS performance. > In our test, when the blockSize is 128, the performance is about 4X comparing > with the blockSize is 4096 (default value). > The following are our test results: > BlockSize(recommendationForAll time) > 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20764) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
[ https://issues.apache.org/jira/browse/SPARK-20764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020662#comment-16020662 ] Peng Meng commented on SPARK-20764: --- I will submit a PR to cover more tests for model summary, thanks. > Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and > GLR - Python version > > > Key: SPARK-20764 > URL: https://issues.apache.org/jira/browse/SPARK-20764 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Assignee: Peng Meng >Priority: Minor > Fix For: 2.2.0 > > > SPARK-20097 exposed {{degreesOfFreedom}} in {{LinearRegressionSummary}} and > {{numInstances}} in {{GeneralizedLinearRegressionSummary}}. Python API should > be updated to reflect these changes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20764) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
[ https://issues.apache.org/jira/browse/SPARK-20764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020555#comment-16020555 ] Peng Meng commented on SPARK-20764: --- Thanks [~yanboliang], is that check too short? For the PySpark LinearRegressionSummary test. There is only two lines of code, just check one function. If there is a typo, suppose I write [return self._call_java("degreeOfFreedom")] (degreeOfFreedom to replace degreesOfFreedom) There will not return any error, and pass all test. But this certainly doesn't work. > Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and > GLR - Python version > > > Key: SPARK-20764 > URL: https://issues.apache.org/jira/browse/SPARK-20764 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Assignee: Peng Meng >Priority: Minor > Fix For: 2.2.0 > > > SPARK-20097 exposed {{degreesOfFreedom}} in {{LinearRegressionSummary}} and > {{numInstances}} in {{GeneralizedLinearRegressionSummary}}. Python API should > be updated to reflect these changes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20764) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
[ https://issues.apache.org/jira/browse/SPARK-20764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019520#comment-16019520 ] Peng Meng commented on SPARK-20764: --- Hi [~mlnick] , why there is no any test code for some classes in PySpark, like LinearRegressionSummary, and GeneralizedLinearRegressionSummary. > Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and > GLR - Python version > > > Key: SPARK-20764 > URL: https://issues.apache.org/jira/browse/SPARK-20764 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > SPARK-20097 exposed {{degreesOfFreedom}} in {{LinearRegressionSummary}} and > {{numInstances}} in {{GeneralizedLinearRegressionSummary}}. Python API should > be updated to reflect these changes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20764) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
[ https://issues.apache.org/jira/browse/SPARK-20764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019096#comment-16019096 ] Peng Meng commented on SPARK-20764: --- Hi [~mlnick], are you working on this, if not, I can work on it. > Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and > GLR - Python version > > > Key: SPARK-20764 > URL: https://issues.apache.org/jira/browse/SPARK-20764 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > SPARK-20097 exposed {{degreesOfFreedom}} in {{LinearRegressionSummary}} and > {{numInstances}} in {{GeneralizedLinearRegressionSummary}}. Python API should > be updated to reflect these changes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983074#comment-15983074 ] Peng Meng commented on SPARK-11968: --- Thanks [~mlnick] , I will post more results here. I latest result is I have changed the PriorityQueue to BoundedPriorityQueue. There is about 30% improvement. Will update the PR and the result. > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983059#comment-15983059 ] Peng Meng commented on SPARK-20443: --- Yes, based on my current test, I agree. But if the data size is large, maybe there is benefit to adjust block size. > The blockSize of MLLIB ALS should be setting by the User > - > > Key: SPARK-20443 > URL: https://issues.apache.org/jira/browse/SPARK-20443 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The blockSize of MLLIB ALS is very important for ALS performance. > In our test, when the blockSize is 128, the performance is about 4X comparing > with the blockSize is 4096 (default value). > The following are our test results: > BlockSize(recommendationForAll time) > 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982251#comment-15982251 ] Peng Meng edited comment on SPARK-20446 at 4/25/17 3:06 PM: Yes, I compared with ML ALSModel.recommendAll. The data size is 480,000*17,000, comparing with mllib recommendAll, there is about 10% to 20% improvement. Blockify is very important for the performance of our solution. Because the compute of CartesianRDD is: for (x <- rdd1.iterator(currSplit.s1, context); y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) If no blockify, rdd2.iterator will be called the numberOfRecords times of rdd1. With blockify, rdd2.iterator will be called the numberOfGroups times of rdd1. The performance difference is very large. SparkSQL approach is much different from my solution. It uses crossJoin to do Cartesian product, the key optimization is in UnsafeCartesianRDD, which uses rowArray to cache the right partition. The key part of our method: 1. use blockify to optimize the Cartesian production computation, not use blockify for BLAS 3. 2. use priorityQueue to save memory (used on each group), not to find TopK product for each user. SqarkSQL approach doesn't work like this. cc [~mengxr] was (Author: peng.m...@intel.com): Yes, I compared with ML ALSModel.recommendAll. The data size is 480,000*17,000, comparing with mllib recommendAll, there is about 10% to 20% improvement. Blockify is very important the performance of our solution. Because the compute of CartesianRDD is: for (x <- rdd1.iterator(currSplit.s1, context); y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) If no blockify, rdd2.iterator will be called the numberOfRecords times of rdd1. With blockify, rdd2.iterator will be called the numberOfGroups times of rdd1. The performance difference is very large. SparkSQL approach is much different from my solution. It uses crossJoin to do Cartesian product, the key optimization is in UnsafeCartesianRDD, which uses rowArray to cache the right partition. The key part of our method: 1. use blockify to optimize the Cartesian production computation, not use blockify for BLAS 3. 2. use priorityQueue to save memory (used on each group), not to find TopK product for each user. SqarkSQL approach doesn't work like this. cc [~mengxr] > Optimize the process of MLLIB ALS recommendForAll > - > > Key: SPARK-20446 > URL: https://issues.apache.org/jira/browse/SPARK-20446 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > The recommendForAll of MLLIB ALS is very slow. > GC is a key problem of the current method. > The task use the following code to keep temp result: > val output = new Array[(Int, (Int, Double))](m*n) > m = n = 4096 (default value, no method to set) > so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and > cause serious GC problem, and it is frequently OOM. > Actually, we don't need to save all the temp result. Suppose we recommend > topK (topK is about 10, or 20) product for each user, we only need 4k * topK > * (4 + 4 + 8) memory to save the temp result. > I have written a solution for this method with the following test result. > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 > BlockSize: 1024 2048 4096 8192 > Old method: 245s 332s 488s OOM > This solution: 121s 118s 117s 120s > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983030#comment-15983030 ] Peng Meng commented on SPARK-20446: --- Thanks [~mlnick] , I agree with you. I am ok to close this ticket and move the discussion to SPARK-11968. > Optimize the process of MLLIB ALS recommendForAll > - > > Key: SPARK-20446 > URL: https://issues.apache.org/jira/browse/SPARK-20446 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > The recommendForAll of MLLIB ALS is very slow. > GC is a key problem of the current method. > The task use the following code to keep temp result: > val output = new Array[(Int, (Int, Double))](m*n) > m = n = 4096 (default value, no method to set) > so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and > cause serious GC problem, and it is frequently OOM. > Actually, we don't need to save all the temp result. Suppose we recommend > topK (topK is about 10, or 20) product for each user, we only need 4k * topK > * (4 + 4 + 8) memory to save the temp result. > I have written a solution for this method with the following test result. > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 > BlockSize: 1024 2048 4096 8192 > Old method: 245s 332s 488s OOM > This solution: 121s 118s 117s 120s > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982251#comment-15982251 ] Peng Meng commented on SPARK-20446: --- Yes, I compared with ML ALSModel.recommendAll. The data size is 480,000*17,000, comparing with mllib recommendAll, there is about 10% to 20% improvement. Blockify is very important the performance of our solution. Because the compute of CartesianRDD is: for (x <- rdd1.iterator(currSplit.s1, context); y <- rdd2.iterator(currSplit.s2, context)) yield (x, y) If no blockify, rdd2.iterator will be called the numberOfRecords times of rdd1. With blockify, rdd2.iterator will be called the numberOfGroups times of rdd1. The performance difference is very large. SparkSQL approach is much different from my solution. It uses crossJoin to do Cartesian product, the key optimization is in UnsafeCartesianRDD, which uses rowArray to cache the right partition. The key part of our method: 1. use blockify to optimize the Cartesian production computation, not use blockify for BLAS 3. 2. use priorityQueue to save memory (used on each group), not to find TopK product for each user. SqarkSQL approach doesn't work like this. cc [~mengxr] > Optimize the process of MLLIB ALS recommendForAll > - > > Key: SPARK-20446 > URL: https://issues.apache.org/jira/browse/SPARK-20446 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > The recommendForAll of MLLIB ALS is very slow. > GC is a key problem of the current method. > The task use the following code to keep temp result: > val output = new Array[(Int, (Int, Double))](m*n) > m = n = 4096 (default value, no method to set) > so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and > cause serious GC problem, and it is frequently OOM. > Actually, we don't need to save all the temp result. Suppose we recommend > topK (topK is about 10, or 20) product for each user, we only need 4k * topK > * (4 + 4 + 8) memory to save the temp result. > I have written a solution for this method with the following test result. > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 > BlockSize: 1024 2048 4096 8192 > Old method: 245s 332s 488s OOM > This solution: 121s 118s 117s 120s > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981148#comment-15981148 ] Peng Meng edited comment on SPARK-20446 at 4/24/17 4:18 PM: Thanks [~mlnick], I also compared DataFrame Version ALS recommendForAll, but no big performance improvement found. In our solution, 1, We use BLAS 2 to replace BLAS 3, which reduce much of GC caused by BLAS computation. 2. We use output = Array[(Int, (Int, Double))](4096*topK) to replace output = Array[(Int, (Int, Double))](4096*4096), which largely reduce the memory allocation. 3. We use priorityQueue for the Sort, which improve about 40% compared with general Sort. In our experiment with different configuration (different number of machines, different number of cores of each machine), this solution is about 5x-9x improvement compared with the current method when the blockSize is 4096. There is no OOM for the this solution, and the performance is about the same with different blockSize. For the old method, the performance is highly related with blockSize. cc [~mengxr] was (Author: peng.m...@intel.com): Thanks [~mlnick], I also compared DataFrame Version ALS recommendForAll, but no big performance improvement found. In our solution, 1, We use BLAS 2 to replace BLAS 3, which reduce much of GC caused by BLAS computation. 2. We use output = Array[(Int, (Int, Double))](4096*topK) to replace output = Array[(Int, (Int, Double))](4096*4096), which largely reduce the memory allocation. 3. We use priorityQueue for the Sort, which improve about 40% compared with general Sort. cc [~mengxr] > Optimize the process of MLLIB ALS recommendForAll > - > > Key: SPARK-20446 > URL: https://issues.apache.org/jira/browse/SPARK-20446 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > The recommendForAll of MLLIB ALS is very slow. > GC is a key problem of the current method. > The task use the following code to keep temp result: > val output = new Array[(Int, (Int, Double))](m*n) > m = n = 4096 (default value, no method to set) > so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and > cause serious GC problem, and it is frequently OOM. > Actually, we don't need to save all the temp result. Suppose we recommend > topK (topK is about 10, or 20) product for each user, we only need 4k * topK > * (4 + 4 + 8) memory to save the temp result. > I have written a solution for this method with the following test result. > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 > BlockSize: 1024 2048 4096 8192 > Old method: 245s 332s 488s OOM > This solution: 121s 118s 117s 120s > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981148#comment-15981148 ] Peng Meng commented on SPARK-20446: --- Thanks [~mlnick], I also compared DataFrame Version ALS recommendForAll, but no big performance improvement found. In our solution, 1, We use BLAS 2 to replace BLAS 3, which reduce much of GC caused by BLAS computation. 2. We use output = Array[(Int, (Int, Double))](4096*topK) to replace output = Array[(Int, (Int, Double))](4096*4096), which largely reduce the memory allocation. 3. We use priorityQueue for the Sort, which improve about 40% compared with general Sort. cc [~mengxr] > Optimize the process of MLLIB ALS recommendForAll > - > > Key: SPARK-20446 > URL: https://issues.apache.org/jira/browse/SPARK-20446 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > The recommendForAll of MLLIB ALS is very slow. > GC is a key problem of the current method. > The task use the following code to keep temp result: > val output = new Array[(Int, (Int, Double))](m*n) > m = n = 4096 (default value, no method to set) > so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and > cause serious GC problem, and it is frequently OOM. > Actually, we don't need to save all the temp result. Suppose we recommend > topK (topK is about 10, or 20) product for each user, we only need 4k * topK > * (4 + 4 + 8) memory to save the temp result. > I have written a solution for this method with the following test result. > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 > BlockSize: 1024 2048 4096 8192 > Old method: 245s 332s 488s OOM > This solution: 121s 118s 117s 120s > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981115#comment-15981115 ] Peng Meng commented on SPARK-20446: --- I think you said: https://github.com/apache/spark/pull/9980 Maybe the problem is the same, but the solution is totally different. PR9980 still use BLAS 3 in the solution. GC is caused by BLAS 3, so 9980 method cannot solve GC problem. Very glad to know many people have studied this problem. > Optimize the process of MLLIB ALS recommendForAll > - > > Key: SPARK-20446 > URL: https://issues.apache.org/jira/browse/SPARK-20446 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > The recommendForAll of MLLIB ALS is very slow. > GC is a key problem of the current method. > The task use the following code to keep temp result: > val output = new Array[(Int, (Int, Double))](m*n) > m = n = 4096 (default value, no method to set) > so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and > cause serious GC problem, and it is frequently OOM. > Actually, we don't need to save all the temp result. Suppose we recommend > topK (topK is about 10, or 20) product for each user, we only need 4k * topK > * (4 + 4 + 8) memory to save the temp result. > I have written a solution for this method with the following test result. > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 > BlockSize: 1024 2048 4096 8192 > Old method: 245s 332s 488s OOM > This solution: 121s 118s 117s 120s > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-20446: -- Description: The recommendForAll of MLLIB ALS is very slow. GC is a key problem of the current method. The task use the following code to keep temp result: val output = new Array[(Int, (Int, Double))](m*n) m = n = 4096 (default value, no method to set) so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and cause serious GC problem, and it is frequently OOM. Actually, we don't need to save all the temp result. Suppose we recommend topK (topK is about 10, or 20) product for each user, we only need 4k * topK * (4 + 4 + 8) memory to save the temp result. I have written a solution for this method with the following test result. The Test Environment: 3 workers: each work 10 core, each work 30G memory, each work 1 executor. The Data: User 480,000, and Item 17,000 BlockSize: 1024 2048 4096 8192 Old method: 245s 332s 488s OOM This solution: 121s 118s 117s 120s was: The recommendForAll of MLLIB ALS is very slow. GC is a key problem of the current method. The task use the following code to keep temp result: val output = new Array[(Int, (Int, Double))](m*n) m = n = 4096 (default value, no method to set) so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and cause serious GC problem, and it is frequently OOM. Actually, we don't need to save all the temp result. Support we recommend topK (topK is about 10, or 20) product for each user, we only need 4k * topK * (4 + 4 + 8) memory to save the temp result. > Optimize the process of MLLIB ALS recommendForAll > - > > Key: SPARK-20446 > URL: https://issues.apache.org/jira/browse/SPARK-20446 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > The recommendForAll of MLLIB ALS is very slow. > GC is a key problem of the current method. > The task use the following code to keep temp result: > val output = new Array[(Int, (Int, Double))](m*n) > m = n = 4096 (default value, no method to set) > so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and > cause serious GC problem, and it is frequently OOM. > Actually, we don't need to save all the temp result. Suppose we recommend > topK (topK is about 10, or 20) product for each user, we only need 4k * topK > * (4 + 4 + 8) memory to save the temp result. > I have written a solution for this method with the following test result. > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 > BlockSize: 1024 2048 4096 8192 > Old method: 245s 332s 488s OOM > This solution: 121s 118s 117s 120s > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll
Peng Meng created SPARK-20446: - Summary: Optimize the process of MLLIB ALS recommendForAll Key: SPARK-20446 URL: https://issues.apache.org/jira/browse/SPARK-20446 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.3.0 Reporter: Peng Meng The recommendForAll of MLLIB ALS is very slow. GC is a key problem of the current method. The task use the following code to keep temp result: val output = new Array[(Int, (Int, Double))](m*n) m = n = 4096 (default value, no method to set) so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and cause serious GC problem, and it is frequently OOM. Actually, we don't need to save all the temp result. Support we recommend topK (topK is about 10, or 20) product for each user, we only need 4k * topK * (4 + 4 + 8) memory to save the temp result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-20443: -- Description: The blockSize of MLLIB ALS is very important for ALS performance. In our test, when the blockSize is 128, the performance is about 4X comparing with the blockSize is 4096 (default value). The following are our test results: BlockSize(recommendationForAll time) 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) The Test Environment: 3 workers: each work 10 core, each work 30G memory, each work 1 executor. The Data: User 480,000, and Item 17,000 was: The blockSize of MLLIB ALS is very important for ALS performance. In our test, when the blockSize is 128, the performance is about 4X comparing with the blockSize is 4096 (default value). The following are our test results: BlockSize(recommendationForAll time) 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) The Test Environment: 3 workers: each work 10 core, each work 30G memory, each work 1 executor. The Data: User 48W, and Item 1.7W > The blockSize of MLLIB ALS should be setting by the User > - > > Key: SPARK-20443 > URL: https://issues.apache.org/jira/browse/SPARK-20443 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng >Priority: Minor > > The blockSize of MLLIB ALS is very important for ALS performance. > In our test, when the blockSize is 128, the performance is about 4X comparing > with the blockSize is 4096 (default value). > The following are our test results: > BlockSize(recommendationForAll time) > 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) > The Test Environment: > 3 workers: each work 10 core, each work 30G memory, each work 1 executor. > The Data: User 480,000, and Item 17,000 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User
Peng Meng created SPARK-20443: - Summary: The blockSize of MLLIB ALS should be setting by the User Key: SPARK-20443 URL: https://issues.apache.org/jira/browse/SPARK-20443 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.3.0 Reporter: Peng Meng Priority: Minor The blockSize of MLLIB ALS is very important for ALS performance. In our test, when the blockSize is 128, the performance is about 4X comparing with the blockSize is 4096 (default value). The following are our test results: BlockSize(recommendationForAll time) 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM) The Test Environment: 3 workers: each work 10 core, each work 30G memory, each work 1 executor. The Data: User 48W, and Item 1.7W -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups
[ https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15608322#comment-15608322 ] Peng Meng commented on SPARK-18088: --- I am neutral for changing the selectorType "KBest" to "numTopFeatures". "KBest", "percentile", and “fpr” is the same as the selector name in sklearn. Now, the selectorType and parameter pairs are: kbest=>numTopFeatures, percentile=>percentile, fpr=>alpha After the change: numTopFeatures=>numTopFeatures, percentile=>percentile, fpr=>alpha The sklearn pairs are: kbest=>k, percentile=>percentile, fpr=>alpha Change the parameter numTopFeatures to k maybe also an option. Hi [~srowen], what's your idea about this change? > ChiSqSelector FPR PR cleanups > - > > Key: SPARK-18088 > URL: https://issues.apache.org/jira/browse/SPARK-18088 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > There are several cleanups I'd like to make as a follow-up to the PRs from > [SPARK-17017]: > * Rename selectorType values to match corresponding Params > * Add Since tags where missing > * a few minor cleanups -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups
[ https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605231#comment-15605231 ] Peng Meng commented on SPARK-18088: --- In the previous implementation, testing against only the statistic is not right. So I submit https://issues.apache.org/jira/browse/SPARK-17870 to fix that bug. Testing against only the p-value is ok. 3 of 5 feature selection methods of sklearn are only based on p-value. The other two is based on statistic. Because the degree of freedom is the same when compute chiSquare value, so sklearn can use statistic. > ChiSqSelector FPR PR cleanups > - > > Key: SPARK-18088 > URL: https://issues.apache.org/jira/browse/SPARK-18088 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > There are several cleanups I'd like to make as a follow-up to the PRs from > [SPARK-17017]: > * Rename selectorType values to match corresponding Params > * Add Since tags where missing > * a few minor cleanups > One major item: FPR is not implemented correctly. Testing against only the > p-value and not the test statistic does not really tell you anything. We > should follow sklearn, which allows a p-value threshold for any selection > method: > [http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html] > * In this PR, I'm just going to remove FPR completely. We can add it back in > a follow-up PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups
[ https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605211#comment-15605211 ] Peng Meng commented on SPARK-18088: --- Hi [~josephkb] , I am not quite understand "Testing against only the p-value and not the test statistic does not really tell you anything. " SelectFPR in sklearn is only based on pValue. > ChiSqSelector FPR PR cleanups > - > > Key: SPARK-18088 > URL: https://issues.apache.org/jira/browse/SPARK-18088 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > There are several cleanups I'd like to make as a follow-up to the PRs from > [SPARK-17017]: > * Rename selectorType values to match corresponding Params > * Add Since tags where missing > * a few minor cleanups > One major item: FPR is not implemented correctly. Testing against only the > p-value and not the test statistic does not really tell you anything. We > should follow sklearn, which allows a p-value threshold for any selection > method: > [http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html] > * In this PR, I'm just going to remove FPR completely. We can add it back in > a follow-up PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18062) ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should return probabilities when given all-0 vector
[ https://issues.apache.org/jira/browse/SPARK-18062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600926#comment-15600926 ] Peng Meng commented on SPARK-18062: --- This relate to how to understand all-0 rawPrediction, all classes are impossible or all classes are the same? If the later, ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should return a valid probability vector with the uniform distribution. > ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should > return probabilities when given all-0 vector > > > Key: SPARK-18062 > URL: https://issues.apache.org/jira/browse/SPARK-18062 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > > {{ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace}} returns > a vector of all-0 when given a rawPrediction vector of all-0. It should > return a valid probability vector with the uniform distribution. > Note: This will be a *behavior change* but it should be very minor and affect > few if any users. But we should note it in release notes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567170#comment-15567170 ] Peng Meng commented on SPARK-17870: --- hi [~avulanov], the question here is not use raw chi2 scores or pvalues, the question is if use raw chi2 scores, the DoF should be the same. "chi2-test is used multiple times" is another problem. According to (http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html),"whenever a statistical test is used multiple times, then the probability of getting at least one error increases.", this problem is partially solved by Select the p-values corresponding to Family-wise error rate (SelectFwe, SPARK-17645). Thanks very much. Hi [~srowen], I totally agree with your comments. Based on the DoF is different in Spark ChiSquare value, we can use the p-values for Spark SelectKBest, and SelectPercentile. Thanks very much. I will submit a pr for this. > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Meng updated SPARK-17870: -- Summary: ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong (was: ML/MLLIB: Statistics.chiSqTest(RDD) is wrong ) > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565315#comment-15565315 ] Peng Meng commented on SPARK-17870: --- https://github.com/apache/spark/pull/1484#issuecomment-51024568 Hi [~mengxr] and [~avulanov] , what do you think about this JIRA. > ML/MLLIB: Statistics.chiSqTest(RDD) is wrong > - > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565251#comment-15565251 ] Peng Meng commented on SPARK-17870: --- The scikit learn code is here: https://github.com/scikit-learn/scikit-learn/blob/412996f09b6756752dfd3736c306d46fca8f1aa1/sklearn/feature_selection/univariate_selection.py, line 422 for selectKBest, chiSquare compute is also on the same page. For the last example, each row of X is a sample, it contain three features, totally 4 samples. Y is the label. Thanks very much. > ML/MLLIB: Statistics.chiSqTest(RDD) is wrong > - > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565225#comment-15565225 ] Peng Meng commented on SPARK-17870: --- yes, the selectKBest and selectPercentile in scikit learn only use statistic. Because the method to count ChiSquare value is different, the DoF of all features in scikit learn are the same. so it can do that. The ChiSquare Value compute process is like this: suppose we have data: X = [ 8 7 0 0 9 6 0 9 8 8 9 5] y = [0 1 1 2]T, this is the test suite data of ml/feature/ChiSquareSelectorSuite.scala sci-kit learn to compute chiSquare value is like this: first: Y = [1 0 0 0 1 0 0 1 0 0 0 1] observed = Y'*X= [8 70 0 18 14 8 9 5] expected = [4 8.5 4.75 8 17 9.5 4 8.5 4.75] _chisquare(ovserved, expected): to compute all features ChiSquare value, we can see all the DF of each feature is the same. Bug for spark Statistics.chiSqTest(RDD), is use another method, for each feature, construct a contingency table. So the DF is different for each feature. For "gives different results from ranking on the statistic", this is because the parameters different. For previous example, if use SelectKBest(2), the selected feature is the same as SelectFpr(0.2) in scikit learn > ML/MLLIB: Statistics.chiSqTest(RDD) is wrong > - > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565041#comment-15565041 ] Peng Meng commented on SPARK-17870: --- hi [~srowen], thanks very much for you quickly reply. yes,the p-value is better than raw statistic in this case, because p-value is count based on DoF and raw statistic. raw statistic is also popular for feature selection. The SelectKBest and SelectPercentile in scikit learn is based on raw statistic. The question here is we should use the same DoF like scikit learn to count ChiSquare value. For this JIRA, I propose to change the method to count ChiSquare value like what is done in scikit learn (change Statistics.chiSqTest(RDD)). Thanks very much. > ML/MLLIB: Statistics.chiSqTest(RDD) is wrong > - > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong
Peng Meng created SPARK-17870: - Summary: ML/MLLIB: Statistics.chiSqTest(RDD) is wrong Key: SPARK-17870 URL: https://issues.apache.org/jira/browse/SPARK-17870 Project: Spark Issue Type: Bug Components: ML, MLlib Reporter: Peng Meng Priority: Critical The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala (line 233) is wrong. For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features. Because of the wrong method to count ChiSquare value, the feature selection results are strange. Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: If use selectKBest to select: the feature 3 will be selected. If use selectFpr to select: feature 1 and 2 will be selected. This is strange. I use scikit learn to test the same data with the same parameters. When use selectKBest to select: feature 1 will be selected. When use selectFpr to select: feature 1 and 2 will be selected. This result is make sense. because the df of each feature in scikit learn is the same. I plan to submit a PR for this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17645) Add feature selector methods based on: False Discovery Rate (FDR) and Family Wise Error rate (FWE)
Peng Meng created SPARK-17645: - Summary: Add feature selector methods based on: False Discovery Rate (FDR) and Family Wise Error rate (FWE) Key: SPARK-17645 URL: https://issues.apache.org/jira/browse/SPARK-17645 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Peng Meng Priority: Minor Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17505) Add setBins for BinaryClassificationMetrics in mlllb/evaluation
Peng Meng created SPARK-17505: - Summary: Add setBins for BinaryClassificationMetrics in mlllb/evaluation Key: SPARK-17505 URL: https://issues.apache.org/jira/browse/SPARK-17505 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Peng Meng Priority: Minor Add a setBins method for BinaryClassificationMetrics. BinaryClassificationMetrics is a class in mllib/Evaluation. numBins is a key attribute of it. If numBins greater than 0, then the curves (ROC curve, PR curve) computed internally will be down-sampled to this many "bins". It is useful to let the user set the numBins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17462) Check for places within MLlib which should use VersionUtils to parse Spark version strings
[ https://issues.apache.org/jira/browse/SPARK-17462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15483336#comment-15483336 ] Peng Meng commented on SPARK-17462: --- hi [~josephkb], I am busy this days, I am glad VinceShieh can help on this. > Check for places within MLlib which should use VersionUtils to parse Spark > version strings > -- > > Key: SPARK-17462 > URL: https://issues.apache.org/jira/browse/SPARK-17462 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > [SPARK-17456] creates a utility for parsing Spark versions. Several places > in MLlib use custom regexes or other approaches. Those should be fixed to > use the new (tested) utility. This task is for checking and replacing those > instances. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17462) Check for places within MLlib which should use VersionUtils to parse Spark version strings
[ https://issues.apache.org/jira/browse/SPARK-17462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477019#comment-15477019 ] Peng Meng commented on SPARK-17462: --- Hi [~josephkb], will you work on this, if not, I can work on it. thanks. > Check for places within MLlib which should use VersionUtils to parse Spark > version strings > -- > > Key: SPARK-17462 > URL: https://issues.apache.org/jira/browse/SPARK-17462 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > [SPARK-17456] creates a utility for parsing Spark versions. Several places > in MLlib use custom regexes or other approaches. Those should be fixed to > use the new (tested) utility. This task is for checking and replacing those > instances. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info
[ https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475860#comment-15475860 ] Peng Meng commented on SPARK-6160: -- hi [~GayathriMurali], are you still working on this, if not, I can work on it. thanks. > ChiSqSelector should keep test statistic info > - > > Key: SPARK-6160 > URL: https://issues.apache.org/jira/browse/SPARK-6160 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > It is useful to have the test statistics explaining selected features, but > these data are thrown out when constructing the ChiSqSelectorModel. The data > are expensive to recompute, so the ChiSqSelectorModel should store and expose > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info
[ https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475769#comment-15475769 ] Peng Meng commented on SPARK-6160: -- Hi [~josephkb], I have some discussion with [~srowen] about keeping test statistic info of ChiSqSelector in PR: https://github.com/apache/spark/pull/14597. Can you review that PR, and commit: https://github.com/apache/spark/pull/14597/commits/3d6aecb8441503c9c3d62a2d8a3d48824b9d6637 > ChiSqSelector should keep test statistic info > - > > Key: SPARK-6160 > URL: https://issues.apache.org/jira/browse/SPARK-6160 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > It is useful to have the test statistics explaining selected features, but > these data are thrown out when constructing the ChiSqSelectorModel. The data > are expensive to recompute, so the ChiSqSelectorModel should store and expose > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org