[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-24 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217979#comment-16217979
 ] 

Peng Meng commented on SPARK-22277:
---

For problem 1 and 2, could you please post the test code. 
For problem 1, one possible case is all the feature ChiSquare statistics value 
is the same, no matter you select which feature, the result is right. 
To code is helpful for analysis of the problem, 

> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t 
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
> org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t
>  at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at 
> org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at 
> sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> +--+---+--+
> |prediction|clicked| 

[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-18 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209166#comment-16209166
 ] 

Peng Meng commented on SPARK-22295:
---

Why there is a space for the column names?  ' plas', ' pres', ' skin', ' test', 
' mass', ' pedi', ' age', 

> Chi Square selector not recognizing field in Data frame
> ---
>
> Key: SPARK-22295
> URL: https://issues.apache.org/jira/browse/SPARK-22295
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in 
> the data frame while fitting the model. I am using PIMA Indians diabetes 
> dataset of UCI. The complete code and output is below for reference. But, 
> when some rows of the input file is created as a dataframe manually, it will 
> work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol='class').fit(df)
> except:
> print(sys.exc_info())
> {code}
> Output:
> ++-+-+-+-+-+-++--+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> ++-+-+-+-+-+-++--+
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
> ++-+-+-+-+-+-++--+
> only showing top 1 row
> ++-+-+-+-+-+-++--++
> |preg| plas| pres| skin| test| mass| pedi| age| class|features|
> ++-+-+-+-+-+-++--++
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
> ++-+-+-+-+-+-++--++
> only showing top 1 row
> (, 
> IllegalArgumentException('Field "class" does not exist.', 
> 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
> scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
> org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
> org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
>  at 
> org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
>  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t 
> at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This 
> will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', 
> "class"])
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selecte

[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-18 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209164#comment-16209164
 ] 

Peng Meng commented on SPARK-22295:
---

Please use  labelCol=" class", not  labelCol="class".

> Chi Square selector not recognizing field in Data frame
> ---
>
> Key: SPARK-22295
> URL: https://issues.apache.org/jira/browse/SPARK-22295
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in 
> the data frame while fitting the model. I am using PIMA Indians diabetes 
> dataset of UCI. The complete code and output is below for reference. But, 
> when some rows of the input file is created as a dataframe manually, it will 
> work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol='class').fit(df)
> except:
> print(sys.exc_info())
> {code}
> Output:
> ++-+-+-+-+-+-++--+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> ++-+-+-+-+-+-++--+
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
> ++-+-+-+-+-+-++--+
> only showing top 1 row
> ++-+-+-+-+-+-++--++
> |preg| plas| pres| skin| test| mass| pedi| age| class|features|
> ++-+-+-+-+-+-++--++
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
> ++-+-+-+-+-+-++--++
> only showing top 1 row
> (, 
> IllegalArgumentException('Field "class" does not exist.', 
> 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
> scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
> org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
> org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
>  at 
> org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
>  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t 
> at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This 
> will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', 
> "class"])
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol="class").fit(df)
> except:
> print(sy

[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame

2017-10-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208708#comment-16208708
 ] 

Peng Meng commented on SPARK-22295:
---

Hi [~cheburakshu] , thanks for reporting this bug and helpful code. 
This is caused by similar problem but not the same thing as SPARK-22277. 
The reason is when transform a dataframe, the field/attribute is not correctly 
set.

Maybe there are some other similar bugs in the code, we can solve them 
separately, or solve them together.   

[~yanboliang] [~mlnick] [~srowen]

> Chi Square selector not recognizing field in Data frame
> ---
>
> Key: SPARK-22295
> URL: https://issues.apache.org/jira/browse/SPARK-22295
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in 
> the data frame while fitting the model. I am using PIMA Indians diabetes 
> dataset of UCI. The complete code and output is below for reference. But, 
> when some rows of the input file is created as a dataframe manually, it will 
> work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' 
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
>   outputCol="selected", labelCol='class').fit(df)
> except:
> print(sys.exc_info())
> {code}
> Output:
> ++-+-+-+-+-+-++--+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> ++-+-+-+-+-+-++--+
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|
> ++-+-+-+-+-+-++--+
> only showing top 1 row
> ++-+-+-+-+-+-++--++
> |preg| plas| pres| skin| test| mass| pedi| age| class|features|
> ++-+-+-+-+-+-++--++
> |   6|  148|   72|   35|0| 33.6|0.627|  50| 1|[6.0,148.0,72.0,3...|
> ++-+-+-+-+-+-++--++
> only showing top 1 row
> (, 
> IllegalArgumentException('Field "class" does not exist.', 
> 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at 
> scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at 
> org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at 
> org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
>  at 
> org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
>  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t 
> at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This 
> will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', 
> "class"])
> df.show(1)
> assembler = VectorAssembler(

[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.

2017-10-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207146#comment-16207146
 ] 

Peng Meng commented on SPARK-22277:
---

This seems is a bug. If no one is working on it. I can work on it. 

> Chi Square selector garbling Vector content.
> 
>
> Key: SPARK-22277
> URL: https://issues.apache.org/jira/browse/SPARK-22277
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
> (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
> (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
> (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>  outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
> dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
> model = dt.fit(result)
> except:
> print(sys.exc_info())
> #Works
> dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
> model = dt.fit(df)
> 
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
> labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t 
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
> org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t
>  at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at 
> org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at 
> sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), )
> +--+---+--+
> |prediction|clicked|  features|
> +--+---+--+
> |   1.0|1.0|[0.0,0.0,18.0,1.0]|
> |   0.0|0.0|[0.0,1.0,12.0,0.0]|
> |   0.0|0.0|[1.0,0.0,15.0,0.1]|
> +-

[jira] [Commented] (SPARK-22115) Add operator for linalg Matrix and Vector

2017-10-08 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16196412#comment-16196412
 ] 

Peng Meng commented on SPARK-22115:
---

ok, thanks.

> Add operator for linalg Matrix and Vector
> -
>
> Key: SPARK-22115
> URL: https://issues.apache.org/jira/browse/SPARK-22115
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Peng Meng
>
> For example, there are many code in LDA like this:
> {code:java}
> phiNorm := expElogbetad * expElogthetad +:+ 1e-100
> {code}
> expElogbetad is a breeze Matrix, expElogthetad is a breeze Vector, 
> This code will call a blas GEMV, then loop the result (:+ 1e-100)
> Actually, this can be done with only GEMV, because the standard interface of 
> gemv is : 
> gemv(alpha, A, x, beta, y)//y := alpha*A*x + beta*y
> We can provide some operators (e.g. Element-wise product (:*), Element-wise 
> sum (:+)) to Spark linalg Matrix and Vector, and replace breeze Matrix and 
> Vector by Spark linalg Matrix and Vector. 
> Then for all the cases like: y = alpha*A*x + beta*y, we can call GEMM or GEMV 
> for it. 
> Don't need to call GEMM or GEMV and then loop the result (for the add) as the 
> current implementation. 
> I can help to do it if we plan to add this feature.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22115) Add operator for linalg Matrix and Vector

2017-10-08 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16196115#comment-16196115
 ] 

Peng Meng commented on SPARK-22115:
---

Hi [~mlnick], I am just back from a national holiday.
If only for LDA, it is ok to make this private. 
But I think it is better to expose a new public API, then we can  extend this 
API, and finally replace breeze Matrix and Vector.


> Add operator for linalg Matrix and Vector
> -
>
> Key: SPARK-22115
> URL: https://issues.apache.org/jira/browse/SPARK-22115
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Peng Meng
>
> For example, there are many code in LDA like this:
> {code:java}
> phiNorm := expElogbetad * expElogthetad +:+ 1e-100
> {code}
> expElogbetad is a breeze Matrix, expElogthetad is a breeze Vector, 
> This code will call a blas GEMV, then loop the result (:+ 1e-100)
> Actually, this can be done with only GEMV, because the standard interface of 
> gemv is : 
> gemv(alpha, A, x, beta, y)//y := alpha*A*x + beta*y
> We can provide some operators (e.g. Element-wise product (:*), Element-wise 
> sum (:+)) to Spark linalg Matrix and Vector, and replace breeze Matrix and 
> Vector by Spark linalg Matrix and Vector. 
> Then for all the cases like: y = alpha*A*x + beta*y, we can call GEMM or GEMV 
> for it. 
> Don't need to call GEMM or GEMV and then loop the result (for the add) as the 
> current implementation. 
> I can help to do it if we plan to add this feature.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22115) Add operator for linalg Matrix and Vector

2017-09-25 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180292#comment-16180292
 ] 

Peng Meng commented on SPARK-22115:
---

Thanks [~srowen], I will begin to work on it.

> Add operator for linalg Matrix and Vector
> -
>
> Key: SPARK-22115
> URL: https://issues.apache.org/jira/browse/SPARK-22115
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Peng Meng
>
> For example, there are many code in LDA like this:
> {code:java}
> phiNorm := expElogbetad * expElogthetad +:+ 1e-100
> {code}
> expElogbetad is a breeze Matrix, expElogthetad is a breeze Vector, 
> This code will call a blas GEMV, then loop the result (:+ 1e-100)
> Actually, this can be done with only GEMV, because the standard interface of 
> gemv is : 
> gemv(alpha, A, x, beta, y)//y := alpha*A*x + beta*y
> We can provide some operators (e.g. Element-wise product (:*), Element-wise 
> sum (:+)) to Spark linalg Matrix and Vector, and replace breeze Matrix and 
> Vector by Spark linalg Matrix and Vector. 
> Then for all the cases like: y = alpha*A*x + beta*y, we can call GEMM or GEMV 
> for it. 
> Don't need to call GEMM or GEMV and then loop the result (for the add) as the 
> current implementation. 
> I can help to do it if we plan to add this feature.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22115) Add operator for linalg Matrix and Vector

2017-09-25 Thread Peng Meng (JIRA)
Peng Meng created SPARK-22115:
-

 Summary: Add operator for linalg Matrix and Vector
 Key: SPARK-22115
 URL: https://issues.apache.org/jira/browse/SPARK-22115
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 3.0.0
Reporter: Peng Meng


For example, there are many code in LDA like this:
{code:java}
phiNorm := expElogbetad * expElogthetad +:+ 1e-100
{code}
expElogbetad is a breeze Matrix, expElogthetad is a breeze Vector, 
This code will call a blas GEMV, then loop the result (:+ 1e-100)

Actually, this can be done with only GEMV, because the standard interface of 
gemv is : 
gemv(alpha, A, x, beta, y)//y := alpha*A*x + beta*y

We can provide some operators (e.g. Element-wise product (:*), Element-wise sum 
(:+)) to Spark linalg Matrix and Vector, and replace breeze Matrix and Vector 
by Spark linalg Matrix and Vector. 

Then for all the cases like: y = alpha*A*x + beta*y, we can call GEMM or GEMV 
for it. 
Don't need to call GEMM or GEMV and then loop the result (for the add) as the 
current implementation. 

I can help to do it if we plan to add this feature.  






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22114) The condition of OnlineLDAOptimizer convergence should be configurable

2017-09-25 Thread Peng Meng (JIRA)
Peng Meng created SPARK-22114:
-

 Summary: The condition of OnlineLDAOptimizer convergence should be 
configurable  
 Key: SPARK-22114
 URL: https://issues.apache.org/jira/browse/SPARK-22114
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Peng Meng
Priority: Minor


The current convergence of OnlineLDAOptimizer is:
{code:java}
while(meanGammaChange > 1e-3)
{code}

The condition of this is critical for the performance and accuracy of LDA. 
We should keep this configurable, like it is in Vowpal Vabbit: 
https://github.com/JohnLangford/vowpal_wabbit/blob/430f69453bc4876a39351fba1f18771bdbdb7122/vowpalwabbit/lda_core.cc
 :638



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-09-06 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155018#comment-16155018
 ] 

Peng Meng commented on SPARK-21476:
---

When there is 50 instances for each partition, this Broadcast is a little 
faster than the current solution,
When there is 500 instances for each partition, this Broadcast is a little 
slower than the current solution.
Both of the difference is about 10%.

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>Priority: Minor
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-09-05 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154849#comment-16154849
 ] 

Peng Meng commented on SPARK-21476:
---

I think the performance difference between my test and yours is the number of 
records in each partition. 
In my test, I use:
model.transform(data)
There are many instances (e.g. 10k) in each partition, so the deserialize time 
of the model is not a big part comparing with the end-to-end time.
In your cases, maybe there are very few instances in each partition, so the 
deserialize time seems very long.
In my cases, when use broadcast, the end-to-end time is long comparing the 
current solution.
We can share the test configuration and do more test. 

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>Priority: Minor
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-08-22 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136538#comment-16136538
 ] 

Peng Meng commented on SPARK-21476:
---

I don't think it can be argued that "In any case using broadcasting causes no 
harm". Actually, in this cases, my test shows using broadcasting the 
performance is a little bad comparing with current solution. 

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121284#comment-16121284
 ] 

Peng Meng commented on SPARK-21688:
---

MKL is just an example of native BLAS, if user has Openblas, ATLAS, an so on. 
It also works.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-10 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121175#comment-16121175
 ] 

Peng Meng commented on SPARK-21680:
---

I mean if the user call toSparse(size), but the size is smaller than 
numNonZero, there maybe problem. 

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-09 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120947#comment-16120947
 ] 

Peng Meng commented on SPARK-21680:
---

Hi [~srowen], if add toSparse(size), for secure reason, it is better to check 
size with numNonzeros, if size is larger than numNonzeros, the program may 
crash. If we check the size with numNonzeros, we still add one more scan to the 
value. 

So in this PR, I revise the code like this JIRA.

Thanks. 

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-09 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120136#comment-16120136
 ] 

Peng Meng commented on SPARK-21680:
---

Ok, thanks, I will submit a PR.

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-09 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120115#comment-16120115
 ] 

Peng Meng commented on SPARK-21680:
---

Then we will have two toSparse:
toSparse
and 
toSparse(size)
Do you mean that? Thanks.

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21624) Optimize communication cost of RF/GBT/DT

2017-08-09 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120105#comment-16120105
 ] 

Peng Meng commented on SPARK-21624:
---

Hi [~mlnick], how do you think about this: 
https://issues.apache.org/jira/browse/SPARK-21680

Thanks.

> Optimize communication cost of RF/GBT/DT
> 
>
> Key: SPARK-21624
> URL: https://issues.apache.org/jira/browse/SPARK-21624
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> {quote}The implementation of RF is bound by either  the cost of statistics 
> computation on workers or by communicating the sufficient statistics.{quote}
> The statistics are stored in allStats:
> {code:java}
>   /**
>* Flat array of elements.
>* Index for start of stats for a (feature, bin) is:
>*   index = featureOffsets(featureIndex) + binIndex * statsSize
>*/
>   private var allStats: Array[Double] = new Array[Double](allStatsSize)
> {code}
> The size of allStats maybe very large, and it can be very sparse, especially 
> on the nodes that near the leave of the tree. 
> I have changed allStats from Array to SparseVector,  my tests show the 
> communication is down by about 50%.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-09 Thread Peng Meng (JIRA)
Peng Meng created SPARK-21680:
-

 Summary: ML/MLLIB Vector compressed optimization
 Key: SPARK-21680
 URL: https://issues.apache.org/jira/browse/SPARK-21680
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Peng Meng


When use Vector.compressed to change a Vector to SparseVector, the performance 
is very low comparing with Vector.toSparse.
This is because you have to scan the value three times using Vector.compressed, 
but you just need two times when use Vector.toSparse.
When the length of the vector is large, there is significant performance 
difference between this two method.
Code of Vector compressed:
{code:java}
  def compressed: Vector = {
val nnz = numNonzeros
// A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 12 
* nnz + 20 bytes.
if (1.5 * (nnz + 1.0) < size) {
  toSparse
} else {
  toDense
}
  }
{code}

I propose to change it to:


{code:java}
// Some comments here
def compressed: Vector = {
val nnz = numNonzeros
// A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 12 
* nnz + 20 bytes.
if (1.5 * (nnz + 1.0) < size) {
  val ii = new Array[Int](nnz)
  val vv = new Array[Double](nnz)
  var k = 0
  foreachActive { (i, v) =>
if (v != 0) {
  ii(k) = i
  vv(k) = v
k += 1
}
}
new SparseVector(size, ii, vv)
} else {
  toDense
}
  }
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21638) Warning message of RF is not accurate

2017-08-07 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-21638:
--
Description: 
When train RF model, there is many warning message like this:
{quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes 
per iteration, which exceeds requested limit maxMemoryUsage=268435456. This 
allows splitting 2622 nodes in this iteration.{quote}
This warning message is unnecessary and the data is not accurate.

Actually, if all the nodes cannot split in one iteration, it will show this 
warning. For most of the case, all the nodes cannot split just in one 
iteration, so for most of the case, it will show this warning for each 
iteration.

This is because:
{code:java}
while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
  val (treeIndex, node) = nodeStack.top
  // Choose subset of features for node (if subsampling).
  val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) 
{
Some(SamplingUtils.reservoirSampleAndCount(Range(0,
  metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
rng.nextLong())._1)
  } else {
None
  }
  // Check if enough memory remains to add this node to the group.
  val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
featureSubset) * 8L
  if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
nodeStack.pop()
mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
mutable.ArrayBuffer[LearningNode]()) +=
  node
mutableTreeToNodeToIndexInfo
  .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
NodeIndexInfo]())(node.id)
  = new NodeIndexInfo(numNodesInGroup, featureSubset)
  }
  numNodesInGroup += 1   //we not add the node to mutableNodesForGroup, but 
we add memUsage here.
  memUsage += nodeMemUsage
}
if (memUsage > maxMemoryUsage) {
  // If maxMemoryUsage is 0, we should still allow splitting 1 node.
  logWarning(s"Tree learning is using approximately $memUsage bytes per 
iteration, which" +
s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This allows 
splitting" +
s" $numNodesInGroup nodes in this iteration.")
}
{code}



  was:
When train RF model, there is many warning message like this:
{quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes 
per iteration, which exceeds requested limit maxMemoryUsage=268435456. This 
allows splitting 2622 nodes in this iteration.{quote}
This warning message is unnecessary and the data is not accurate.
This is because

{code:java}
while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
  val (treeIndex, node) = nodeStack.top
  // Choose subset of features for node (if subsampling).
  val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) 
{
Some(SamplingUtils.reservoirSampleAndCount(Range(0,
  metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
rng.nextLong())._1)
  } else {
None
  }
  // Check if enough memory remains to add this node to the group.
  val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
featureSubset) * 8L
  if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
nodeStack.pop()
mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
mutable.ArrayBuffer[LearningNode]()) +=
  node
mutableTreeToNodeToIndexInfo
  .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
NodeIndexInfo]())(node.id)
  = new NodeIndexInfo(numNodesInGroup, featureSubset)
  }
  numNodesInGroup += 1  *//we not add the node to mutableNodesForGroup, but 
we add memUsage here.*
  memUsage += nodeMemUsage
}
if (memUsage > maxMemoryUsage) {
  // If maxMemoryUsage is 0, we should still allow splitting 1 node.
  logWarning(s"Tree learning is using approximately $memUsage bytes per 
iteration, which" +
s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This allows 
splitting" +
s" $numNodesInGroup nodes in this iteration.")
}
{code}

To avoid this unnecessary warning, we should change the code like this:

{code:java}
while (nodeStack.nonEmpty) {
  val (treeIndex, node) = nodeStack.top
  // Choose subset of features for node (if subsampling).
  val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) 
{
Some(SamplingUtils.reservoirSampleAndCount(Range(0,
  metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
rng.nextLong())._1)
  } else {
None
  }
  // Check if enough memory remains to add this node to the group.
  val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
featureSubset) * 8L
  if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
nodeStack.p

[jira] [Updated] (SPARK-21624) Optimize communication cost of RF/GBT/DT

2017-08-06 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-21624:
--
Description: 
{quote}The implementation of RF is bound by either  the cost of statistics 
computation on workers or by communicating the sufficient statistics.{quote}

The statistics are stored in allStats:

{code:java}
  /**
   * Flat array of elements.
   * Index for start of stats for a (feature, bin) is:
   *   index = featureOffsets(featureIndex) + binIndex * statsSize
   */
  private var allStats: Array[Double] = new Array[Double](allStatsSize)
{code}
The size of allStats maybe very large, and it can be very sparse, especially on 
the nodes that near the leave of the tree. 

I have changed allStats from Array to SparseVector,  my tests show the 
communication is down by about 50%.


  was:
{quote}The implementation of RF is bound by either  the cost of statistics 
computation on workers or by communicating the sufficient statistics.{quote}

The statistics are stored in allStats:

{code:java}
  /**
   * Flat array of elements.
   * Index for start of stats for a (feature, bin) is:
   *   index = featureOffsets(featureIndex) + binIndex * statsSize
   */
  private var allStats: Array[Double] = new Array[Double](allStatsSize)
{code}
The size of allStats maybe very large, and it can be very spare, especially on 
the nodes that near the leave of the tree. 

I have changed allStats from Array to SparseVector,  my tests show the 
communication is down by about 50%.



> Optimize communication cost of RF/GBT/DT
> 
>
> Key: SPARK-21624
> URL: https://issues.apache.org/jira/browse/SPARK-21624
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> {quote}The implementation of RF is bound by either  the cost of statistics 
> computation on workers or by communicating the sufficient statistics.{quote}
> The statistics are stored in allStats:
> {code:java}
>   /**
>* Flat array of elements.
>* Index for start of stats for a (feature, bin) is:
>*   index = featureOffsets(featureIndex) + binIndex * statsSize
>*/
>   private var allStats: Array[Double] = new Array[Double](allStatsSize)
> {code}
> The size of allStats maybe very large, and it can be very sparse, especially 
> on the nodes that near the leave of the tree. 
> I have changed allStats from Array to SparseVector,  my tests show the 
> communication is down by about 50%.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21638) Warning message of RF is not accurate

2017-08-04 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114156#comment-16114156
 ] 

Peng Meng commented on SPARK-21638:
---

The first data should - nodeMemUsage

> Warning message of RF is not accurate
> -
>
> Key: SPARK-21638
> URL: https://issues.apache.org/jira/browse/SPARK-21638
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: 
>Reporter: Peng Meng
>Priority: Minor
>
> When train RF model, there is many warning message like this:
> {quote}WARN RandomForest: Tree learning is using approximately 268492800 
> bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. 
> This allows splitting 2622 nodes in this iteration.{quote}
> This warning message is unnecessary and the data is not accurate.
> This is because
> {code:java}
> while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
>   val (treeIndex, node) = nodeStack.top
>   // Choose subset of features for node (if subsampling).
>   val featureSubset: Option[Array[Int]] = if 
> (metadata.subsamplingFeatures) {
> Some(SamplingUtils.reservoirSampleAndCount(Range(0,
>   metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
> rng.nextLong())._1)
>   } else {
> None
>   }
>   // Check if enough memory remains to add this node to the group.
>   val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
> featureSubset) * 8L
>   if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
> nodeStack.pop()
> mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
> mutable.ArrayBuffer[LearningNode]()) +=
>   node
> mutableTreeToNodeToIndexInfo
>   .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
> NodeIndexInfo]())(node.id)
>   = new NodeIndexInfo(numNodesInGroup, featureSubset)
>   }
>   numNodesInGroup += 1  *//we not add the node to mutableNodesForGroup, 
> but we add memUsage here.*
>   memUsage += nodeMemUsage
> }
> if (memUsage > maxMemoryUsage) {
>   // If maxMemoryUsage is 0, we should still allow splitting 1 node.
>   logWarning(s"Tree learning is using approximately $memUsage bytes per 
> iteration, which" +
> s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This 
> allows splitting" +
> s" $numNodesInGroup nodes in this iteration.")
> }
> {code}
> To avoid this unnecessary warning, we should change the code like this:
> {code:java}
> while (nodeStack.nonEmpty) {
>   val (treeIndex, node) = nodeStack.top
>   // Choose subset of features for node (if subsampling).
>   val featureSubset: Option[Array[Int]] = if 
> (metadata.subsamplingFeatures) {
> Some(SamplingUtils.reservoirSampleAndCount(Range(0,
>   metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
> rng.nextLong())._1)
>   } else {
> None
>   }
>   // Check if enough memory remains to add this node to the group.
>   val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
> featureSubset) * 8L
>   if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
> nodeStack.pop()
> mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
> mutable.ArrayBuffer[LearningNode]()) +=
>   node
> mutableTreeToNodeToIndexInfo
>   .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
> NodeIndexInfo]())(node.id)
>   = new NodeIndexInfo(numNodesInGroup, featureSubset)
> numNodesInGroup += 1 
> memUsage += nodeMemUsage
>   } else { 
>  break
>}
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21638) Warning message of RF is not accurate

2017-08-04 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114153#comment-16114153
 ] 

Peng Meng commented on SPARK-21638:
---

In the example warning message, the split node shoud be 2621;  

> Warning message of RF is not accurate
> -
>
> Key: SPARK-21638
> URL: https://issues.apache.org/jira/browse/SPARK-21638
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: 
>Reporter: Peng Meng
>Priority: Minor
>
> When train RF model, there is many warning message like this:
> {quote}WARN RandomForest: Tree learning is using approximately 268492800 
> bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. 
> This allows splitting 2622 nodes in this iteration.{quote}
> This warning message is unnecessary and the data is not accurate.
> This is because
> {code:java}
> while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
>   val (treeIndex, node) = nodeStack.top
>   // Choose subset of features for node (if subsampling).
>   val featureSubset: Option[Array[Int]] = if 
> (metadata.subsamplingFeatures) {
> Some(SamplingUtils.reservoirSampleAndCount(Range(0,
>   metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
> rng.nextLong())._1)
>   } else {
> None
>   }
>   // Check if enough memory remains to add this node to the group.
>   val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
> featureSubset) * 8L
>   if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
> nodeStack.pop()
> mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
> mutable.ArrayBuffer[LearningNode]()) +=
>   node
> mutableTreeToNodeToIndexInfo
>   .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
> NodeIndexInfo]())(node.id)
>   = new NodeIndexInfo(numNodesInGroup, featureSubset)
>   }
>   numNodesInGroup += 1  *//we not add the node to mutableNodesForGroup, 
> but we add memUsage here.*
>   memUsage += nodeMemUsage
> }
> if (memUsage > maxMemoryUsage) {
>   // If maxMemoryUsage is 0, we should still allow splitting 1 node.
>   logWarning(s"Tree learning is using approximately $memUsage bytes per 
> iteration, which" +
> s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This 
> allows splitting" +
> s" $numNodesInGroup nodes in this iteration.")
> }
> {code}
> To avoid this unnecessary warning, we should change the code like this:
> {code:java}
> while (nodeStack.nonEmpty) {
>   val (treeIndex, node) = nodeStack.top
>   // Choose subset of features for node (if subsampling).
>   val featureSubset: Option[Array[Int]] = if 
> (metadata.subsamplingFeatures) {
> Some(SamplingUtils.reservoirSampleAndCount(Range(0,
>   metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
> rng.nextLong())._1)
>   } else {
> None
>   }
>   // Check if enough memory remains to add this node to the group.
>   val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
> featureSubset) * 8L
>   if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
> nodeStack.pop()
> mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
> mutable.ArrayBuffer[LearningNode]()) +=
>   node
> mutableTreeToNodeToIndexInfo
>   .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
> NodeIndexInfo]())(node.id)
>   = new NodeIndexInfo(numNodesInGroup, featureSubset)
> numNodesInGroup += 1 
> memUsage += nodeMemUsage
>   } else { 
>  break
>}
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21638) Warning message of RF is not accurate

2017-08-04 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114133#comment-16114133
 ] 

Peng Meng commented on SPARK-21638:
---

I will be back home now, will answer your question next week. Thanks [~srowen]

> Warning message of RF is not accurate
> -
>
> Key: SPARK-21638
> URL: https://issues.apache.org/jira/browse/SPARK-21638
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: 
>Reporter: Peng Meng
>Priority: Minor
>
> When train RF model, there is many warning message like this:
> {quote}WARN RandomForest: Tree learning is using approximately 268492800 
> bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. 
> This allows splitting 2622 nodes in this iteration.{quote}
> This warning message is unnecessary and the data is not accurate.
> This is because
> {code:java}
> while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
>   val (treeIndex, node) = nodeStack.top
>   // Choose subset of features for node (if subsampling).
>   val featureSubset: Option[Array[Int]] = if 
> (metadata.subsamplingFeatures) {
> Some(SamplingUtils.reservoirSampleAndCount(Range(0,
>   metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
> rng.nextLong())._1)
>   } else {
> None
>   }
>   // Check if enough memory remains to add this node to the group.
>   val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
> featureSubset) * 8L
>   if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
> nodeStack.pop()
> mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
> mutable.ArrayBuffer[LearningNode]()) +=
>   node
> mutableTreeToNodeToIndexInfo
>   .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
> NodeIndexInfo]())(node.id)
>   = new NodeIndexInfo(numNodesInGroup, featureSubset)
>   }
>   numNodesInGroup += 1  *//we not add the node to mutableNodesForGroup, 
> but we add memUsage here.*
>   memUsage += nodeMemUsage
> }
> if (memUsage > maxMemoryUsage) {
>   // If maxMemoryUsage is 0, we should still allow splitting 1 node.
>   logWarning(s"Tree learning is using approximately $memUsage bytes per 
> iteration, which" +
> s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This 
> allows splitting" +
> s" $numNodesInGroup nodes in this iteration.")
> }
> {code}
> To avoid this unnecessary warning, we should change the code like this:
> {code:java}
> while (nodeStack.nonEmpty) {
>   val (treeIndex, node) = nodeStack.top
>   // Choose subset of features for node (if subsampling).
>   val featureSubset: Option[Array[Int]] = if 
> (metadata.subsamplingFeatures) {
> Some(SamplingUtils.reservoirSampleAndCount(Range(0,
>   metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
> rng.nextLong())._1)
>   } else {
> None
>   }
>   // Check if enough memory remains to add this node to the group.
>   val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
> featureSubset) * 8L
>   if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
> nodeStack.pop()
> mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
> mutable.ArrayBuffer[LearningNode]()) +=
>   node
> mutableTreeToNodeToIndexInfo
>   .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
> NodeIndexInfo]())(node.id)
>   = new NodeIndexInfo(numNodesInGroup, featureSubset)
> numNodesInGroup += 1 
> memUsage += nodeMemUsage
>   } else { 
>  break
>}
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21638) Warning message of RF is not accurate

2017-08-04 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114132#comment-16114132
 ] 

Peng Meng commented on SPARK-21638:
---

This is because "we not add the node to mutableNodesForGroup, but we add 
memUsage" in current implemantation.
This logic is not accurate.

> Warning message of RF is not accurate
> -
>
> Key: SPARK-21638
> URL: https://issues.apache.org/jira/browse/SPARK-21638
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: 
>Reporter: Peng Meng
>Priority: Minor
>
> When train RF model, there is many warning message like this:
> {quote}WARN RandomForest: Tree learning is using approximately 268492800 
> bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. 
> This allows splitting 2622 nodes in this iteration.{quote}
> This warning message is unnecessary and the data is not accurate.
> This is because
> {code:java}
> while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
>   val (treeIndex, node) = nodeStack.top
>   // Choose subset of features for node (if subsampling).
>   val featureSubset: Option[Array[Int]] = if 
> (metadata.subsamplingFeatures) {
> Some(SamplingUtils.reservoirSampleAndCount(Range(0,
>   metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
> rng.nextLong())._1)
>   } else {
> None
>   }
>   // Check if enough memory remains to add this node to the group.
>   val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
> featureSubset) * 8L
>   if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
> nodeStack.pop()
> mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
> mutable.ArrayBuffer[LearningNode]()) +=
>   node
> mutableTreeToNodeToIndexInfo
>   .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
> NodeIndexInfo]())(node.id)
>   = new NodeIndexInfo(numNodesInGroup, featureSubset)
>   }
>   numNodesInGroup += 1  *//we not add the node to mutableNodesForGroup, 
> but we add memUsage here.*
>   memUsage += nodeMemUsage
> }
> if (memUsage > maxMemoryUsage) {
>   // If maxMemoryUsage is 0, we should still allow splitting 1 node.
>   logWarning(s"Tree learning is using approximately $memUsage bytes per 
> iteration, which" +
> s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This 
> allows splitting" +
> s" $numNodesInGroup nodes in this iteration.")
> }
> {code}
> To avoid this unnecessary warning, we should change the code like this:
> {code:java}
> while (nodeStack.nonEmpty) {
>   val (treeIndex, node) = nodeStack.top
>   // Choose subset of features for node (if subsampling).
>   val featureSubset: Option[Array[Int]] = if 
> (metadata.subsamplingFeatures) {
> Some(SamplingUtils.reservoirSampleAndCount(Range(0,
>   metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
> rng.nextLong())._1)
>   } else {
> None
>   }
>   // Check if enough memory remains to add this node to the group.
>   val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
> featureSubset) * 8L
>   if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
> nodeStack.pop()
> mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
> mutable.ArrayBuffer[LearningNode]()) +=
>   node
> mutableTreeToNodeToIndexInfo
>   .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
> NodeIndexInfo]())(node.id)
>   = new NodeIndexInfo(numNodesInGroup, featureSubset)
> numNodesInGroup += 1 
> memUsage += nodeMemUsage
>   } else { 
>  break
>}
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21638) Warning message of RF is not accurate

2017-08-04 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-21638:
--
Description: 
When train RF model, there is many warning message like this:
{quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes 
per iteration, which exceeds requested limit maxMemoryUsage=268435456. This 
allows splitting 2622 nodes in this iteration.{quote}
This warning message is unnecessary and the data is not accurate.
This is because

{code:java}
while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
  val (treeIndex, node) = nodeStack.top
  // Choose subset of features for node (if subsampling).
  val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) 
{
Some(SamplingUtils.reservoirSampleAndCount(Range(0,
  metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
rng.nextLong())._1)
  } else {
None
  }
  // Check if enough memory remains to add this node to the group.
  val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
featureSubset) * 8L
  if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
nodeStack.pop()
mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
mutable.ArrayBuffer[LearningNode]()) +=
  node
mutableTreeToNodeToIndexInfo
  .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
NodeIndexInfo]())(node.id)
  = new NodeIndexInfo(numNodesInGroup, featureSubset)
  }
  numNodesInGroup += 1  *//we not add the node to mutableNodesForGroup, but 
we add memUsage here.*
  memUsage += nodeMemUsage
}
if (memUsage > maxMemoryUsage) {
  // If maxMemoryUsage is 0, we should still allow splitting 1 node.
  logWarning(s"Tree learning is using approximately $memUsage bytes per 
iteration, which" +
s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This allows 
splitting" +
s" $numNodesInGroup nodes in this iteration.")
}
{code}

To avoid this unnecessary warning, we should change the code like this:

{code:java}
while (nodeStack.nonEmpty) {
  val (treeIndex, node) = nodeStack.top
  // Choose subset of features for node (if subsampling).
  val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) 
{
Some(SamplingUtils.reservoirSampleAndCount(Range(0,
  metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
rng.nextLong())._1)
  } else {
None
  }
  // Check if enough memory remains to add this node to the group.
  val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
featureSubset) * 8L
  if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
nodeStack.pop()
mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
mutable.ArrayBuffer[LearningNode]()) +=
  node
mutableTreeToNodeToIndexInfo
  .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
NodeIndexInfo]())(node.id)
  = new NodeIndexInfo(numNodesInGroup, featureSubset)
numNodesInGroup += 1 
memUsage += nodeMemUsage
  } else { 
 break
   }
}
{code}


  was:
When train RF model, there is many warning message like this:
{quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes 
per iteration, which exceeds requested limit maxMemoryUsage=268435456. This 
allows splitting 2622 nodes in this iteration.{quote}
This warning message is unnecessary and the data is not accurate.
This is because

{code:java}
while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
  val (treeIndex, node) = nodeStack.top
  // Choose subset of features for node (if subsampling).
  val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) 
{
Some(SamplingUtils.reservoirSampleAndCount(Range(0,
  metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
rng.nextLong())._1)
  } else {
None
  }
  // Check if enough memory remains to add this node to the group.
  val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
featureSubset) * 8L
  if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
nodeStack.pop()
mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
mutable.ArrayBuffer[LearningNode]()) +=
  node
mutableTreeToNodeToIndexInfo
  .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
NodeIndexInfo]())(node.id)
  = new NodeIndexInfo(numNodesInGroup, featureSubset)
  }
  numNodesInGroup += 1  *//we not add the node to mutableNodesForGroup, but 
we add memUsage here.*
  memUsage += nodeMemUsage
}
if (memUsage > maxMemoryUsage) {
  // If maxMemoryUsage is 0, we should still allow splitting 1 node.
  logWarning(s"Tree learning is using ap

[jira] [Updated] (SPARK-21638) Warning message of RF is not accurate

2017-08-04 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-21638:
--
Description: 
When train RF model, there is many warning message like this:
{quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes 
per iteration, which exceeds requested limit maxMemoryUsage=268435456. This 
allows splitting 2622 nodes in this iteration.{quote}
This warning message is unnecessary and the data is not accurate.
This is because

{code:java}
while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
  val (treeIndex, node) = nodeStack.top
  // Choose subset of features for node (if subsampling).
  val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) 
{
Some(SamplingUtils.reservoirSampleAndCount(Range(0,
  metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
rng.nextLong())._1)
  } else {
None
  }
  // Check if enough memory remains to add this node to the group.
  val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
featureSubset) * 8L
  if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
nodeStack.pop()
mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
mutable.ArrayBuffer[LearningNode]()) +=
  node
mutableTreeToNodeToIndexInfo
  .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
NodeIndexInfo]())(node.id)
  = new NodeIndexInfo(numNodesInGroup, featureSubset)
  }
  numNodesInGroup += 1  *//we not add the node to mutableNodesForGroup, but 
we add memUsage here.*
  memUsage += nodeMemUsage
}
if (memUsage > maxMemoryUsage) {
  // If maxMemoryUsage is 0, we should still allow splitting 1 node.
  logWarning(s"Tree learning is using approximately $memUsage bytes per 
iteration, which" +
s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This allows 
splitting" +
s" $numNodesInGroup nodes in this iteration.")
}
{code}

To avoid this unnecessary warning, we should change the code like this:

{code:java}
while (nodeStack.nonEmpty) {
  val (treeIndex, node) = nodeStack.top
  // Choose subset of features for node (if subsampling).
  val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) 
{
Some(SamplingUtils.reservoirSampleAndCount(Range(0,
  metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
rng.nextLong())._1)
  } else {
None
  }
  // Check if enough memory remains to add this node to the group.
  val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
featureSubset) * 8L
  if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
nodeStack.pop()
mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
mutable.ArrayBuffer[LearningNode]()) +=
  node
mutableTreeToNodeToIndexInfo
  .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
NodeIndexInfo]())(node.id)
  = new NodeIndexInfo(numNodesInGroup, featureSubset)
numNodesInGroup += 1  //we not add the node to 
mutableNodesForGroup, but we add memUsage here.
memUsage += nodeMemUsage
  } else { 
 break
   }
}
{code}


  was:
When train RF model, there is many warning message like this:
{quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes 
per iteration, which exceeds requested limit maxMemoryUsage=268435456. This 
allows splitting 2622 nodes in this iteration.{quote}
This warning message is unnecessary and the data is not accuracy.
This is because

{code:java}
while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
  val (treeIndex, node) = nodeStack.top
  // Choose subset of features for node (if subsampling).
  val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) 
{
Some(SamplingUtils.reservoirSampleAndCount(Range(0,
  metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
rng.nextLong())._1)
  } else {
None
  }
  // Check if enough memory remains to add this node to the group.
  val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
featureSubset) * 8L
  if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
nodeStack.pop()
mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
mutable.ArrayBuffer[LearningNode]()) +=
  node
mutableTreeToNodeToIndexInfo
  .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
NodeIndexInfo]())(node.id)
  = new NodeIndexInfo(numNodesInGroup, featureSubset)
  }
  numNodesInGroup += 1  *//we not add the node to mutableNodesForGroup, but 
we add memUsage here.*
  memUsage += nodeMemUsage
}
if (memUsage > maxMemoryUsage) {
  // If maxMemoryUsage is 0, we should 

[jira] [Created] (SPARK-21638) Warning message of RF is not accurate

2017-08-04 Thread Peng Meng (JIRA)
Peng Meng created SPARK-21638:
-

 Summary: Warning message of RF is not accurate
 Key: SPARK-21638
 URL: https://issues.apache.org/jira/browse/SPARK-21638
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.0
 Environment: 


Reporter: Peng Meng
Priority: Minor


When train RF model, there is many warning message like this:
{quote}WARN RandomForest: Tree learning is using approximately 268492800 bytes 
per iteration, which exceeds requested limit maxMemoryUsage=268435456. This 
allows splitting 2622 nodes in this iteration.{quote}
This warning message is unnecessary and the data is not accuracy.
This is because

{code:java}
while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
  val (treeIndex, node) = nodeStack.top
  // Choose subset of features for node (if subsampling).
  val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) 
{
Some(SamplingUtils.reservoirSampleAndCount(Range(0,
  metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
rng.nextLong())._1)
  } else {
None
  }
  // Check if enough memory remains to add this node to the group.
  val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
featureSubset) * 8L
  if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
nodeStack.pop()
mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
mutable.ArrayBuffer[LearningNode]()) +=
  node
mutableTreeToNodeToIndexInfo
  .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
NodeIndexInfo]())(node.id)
  = new NodeIndexInfo(numNodesInGroup, featureSubset)
  }
  numNodesInGroup += 1  *//we not add the node to mutableNodesForGroup, but 
we add memUsage here.*
  memUsage += nodeMemUsage
}
if (memUsage > maxMemoryUsage) {
  // If maxMemoryUsage is 0, we should still allow splitting 1 node.
  logWarning(s"Tree learning is using approximately $memUsage bytes per 
iteration, which" +
s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This allows 
splitting" +
s" $numNodesInGroup nodes in this iteration.")
}
{code}

To avoid this unnecessary warning, we should change the code like this:

{code:java}
while (nodeStack.nonEmpty) {
  val (treeIndex, node) = nodeStack.top
  // Choose subset of features for node (if subsampling).
  val featureSubset: Option[Array[Int]] = if (metadata.subsamplingFeatures) 
{
Some(SamplingUtils.reservoirSampleAndCount(Range(0,
  metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
rng.nextLong())._1)
  } else {
None
  }
  // Check if enough memory remains to add this node to the group.
  val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
featureSubset) * 8L
  if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
nodeStack.pop()
mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
mutable.ArrayBuffer[LearningNode]()) +=
  node
mutableTreeToNodeToIndexInfo
  .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
NodeIndexInfo]())(node.id)
  = new NodeIndexInfo(numNodesInGroup, featureSubset)
numNodesInGroup += 1  //we not add the node to 
mutableNodesForGroup, but we add memUsage here.
memUsage += nodeMemUsage
  } else { 
 break
   }
}
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21624) Optimize communication cost of RF/GBT/DT

2017-08-03 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113793#comment-16113793
 ] 

Peng Meng commented on SPARK-21624:
---

Thanks [~mlnick], use Vector and compress is reasonable. I will submit a PR and 
show the performance data. Thanks.

> Optimize communication cost of RF/GBT/DT
> 
>
> Key: SPARK-21624
> URL: https://issues.apache.org/jira/browse/SPARK-21624
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> {quote}The implementation of RF is bound by either  the cost of statistics 
> computation on workers or by communicating the sufficient statistics.{quote}
> The statistics are stored in allStats:
> {code:java}
>   /**
>* Flat array of elements.
>* Index for start of stats for a (feature, bin) is:
>*   index = featureOffsets(featureIndex) + binIndex * statsSize
>*/
>   private var allStats: Array[Double] = new Array[Double](allStatsSize)
> {code}
> The size of allStats maybe very large, and it can be very spare, especially 
> on the nodes that near the leave of the tree. 
> I have changed allStats from Array to SparseVector,  my tests show the 
> communication is down by about 50%.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21624) Optimize communication cost of RF/GBT/DT

2017-08-03 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112374#comment-16112374
 ] 

Peng Meng commented on SPARK-21624:
---

ping [~josephkb] [~srowen] [~yanboliang] [~mlnick] [~yuhaoyan]

> Optimize communication cost of RF/GBT/DT
> 
>
> Key: SPARK-21624
> URL: https://issues.apache.org/jira/browse/SPARK-21624
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> {quote}The implementation of RF is bound by either  the cost of statistics 
> computation on workers or by communicating the sufficient statistics.{quote}
> The statistics are stored in allStats:
> {code:java}
>   /**
>* Flat array of elements.
>* Index for start of stats for a (feature, bin) is:
>*   index = featureOffsets(featureIndex) + binIndex * statsSize
>*/
>   private var allStats: Array[Double] = new Array[Double](allStatsSize)
> {code}
> The size of allStats maybe very large, and it can be very spare, especially 
> on the nodes that near the leave of the tree. 
> I have changed allStats from Array to SparseVector,  my tests show the 
> communication is down by about 50%.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21624) Optimize communication cost of RF/GBT/DT

2017-08-03 Thread Peng Meng (JIRA)
Peng Meng created SPARK-21624:
-

 Summary: Optimize communication cost of RF/GBT/DT
 Key: SPARK-21624
 URL: https://issues.apache.org/jira/browse/SPARK-21624
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Peng Meng


{quote}The implementation of RF is bound by either  the cost of statistics 
computation on workers or by communicating the sufficient statistics.{quote}

The statistics are stored in allStats:

{code:java}
  /**
   * Flat array of elements.
   * Index for start of stats for a (feature, bin) is:
   *   index = featureOffsets(featureIndex) + binIndex * statsSize
   */
  private var allStats: Array[Double] = new Array[Double](allStatsSize)
{code}
The size of allStats maybe very large, and it can be very spare, especially on 
the nodes that near the leave of the tree. 

I have changed allStats from Array to SparseVector,  my tests show the 
communication is down by about 50%.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21623) Comments of parentStats on ml/tree/impl/DTStatsAggregator.scala is wrong

2017-08-03 Thread Peng Meng (JIRA)
Peng Meng created SPARK-21623:
-

 Summary: Comments of parentStats on 
ml/tree/impl/DTStatsAggregator.scala is wrong
 Key: SPARK-21623
 URL: https://issues.apache.org/jira/browse/SPARK-21623
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Peng Meng
Priority: Minor



{code:java}
   * Note: this is necessary because stats for the parent node are not available
   *   on the first iteration of tree learning.
   */
  private val parentStats: Array[Double] = new Array[Double](statsSize)
{code}

This comment is not right. Actually,  parentStats is not only used for the 
first iteration. It is used with all the iteration for unordered featrues.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-26 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101476#comment-16101476
 ] 

Peng Meng commented on SPARK-21476:
---

Hi [~sagraw], could you please test copy pasted the transform method from 
ProbabilisticClassifier into RandomForestClassificationModel, and added 
broadcasting inside transform. Thanks.

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-26 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101476#comment-16101476
 ] 

Peng Meng edited comment on SPARK-21476 at 7/26/17 10:06 AM:
-

Hi [~sagraw], could you please test copy pasted the transform method from  
ProbabilisticClassificationModel into RandomForestClassificationModel, and 
added broadcasting inside transform. Thanks.


was (Author: peng.m...@intel.com):
Hi [~sagraw], could you please test copy pasted the transform method from 
ProbabilisticClassifier into RandomForestClassificationModel, and added 
broadcasting inside transform. Thanks.

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-26 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101253#comment-16101253
 ] 

Peng Meng commented on SPARK-21476:
---

Not each transform uses broadcast, do you have some experiment data shows 
broadcast is a problem here. Thanks. 

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-24 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099593#comment-16099593
 ] 

Peng Meng edited comment on SPARK-21476 at 7/25/17 6:55 AM:


Hi @Suarabh, I am profiling RF transform performance. I change transform to use 
transformImpl which uses broadcast, but find there is no performance 
improvement. 
Could you show me what is your case?  For example, the tree size, number of 
features, dataset partitions, and number of executors. Thanks. 


was (Author: peng.m...@intel.com):
Hi @Suarabh, I am profiling RF transform performance. I change transform to use 
transformImpl which uses broadcase, but find there is no performance 
improvement. 
Could you show me what is your case?  For example, the tree size, number of 
features, dataset partitions, and number of executors. Thanks. 

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-24 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099593#comment-16099593
 ] 

Peng Meng commented on SPARK-21476:
---

Hi @Suarabh, I am profiling RF transform performance. I change transform to use 
transformImpl which uses broadcase, but find there is no performance 
improvement. 
Could you show me what is your case?  For example, the tree size, number of 
features, dataset partitions, and number of executors. Thanks. 

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-20 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094603#comment-16094603
 ] 

Peng Meng commented on SPARK-21476:
---

I am optimizing RF and GBT these days, if no one works on it. I can working on 
it. Thanks.

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-19 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094266#comment-16094266
 ] 

Peng Meng commented on SPARK-21476:
---

Seems transform should use transformImpl but not use?

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

2017-07-19 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094106#comment-16094106
 ] 

Peng Meng commented on SPARK-2465:
--

I think it is time to revisit this now.  Some of our customers, such as JD.com, 
ask us to support Long ID for ALS. Actually, they have more than Int.MaxValue 
products.  Long ID of ALS is necessary for them. 
How to you think to reopen your PR? [~srowen]

> Use long as user / item ID for ALS
> --
>
> Key: SPARK-2465
> URL: https://issues.apache.org/jira/browse/SPARK-2465
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Priority: Minor
> Attachments: ALS using MEMORY_AND_DISK.png, ALS using 
> MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png
>
>
> I'd like to float this for consideration: use longs instead of ints for user 
> and product IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and 
> will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
> means collisions are likely after hundreds of thousands of users and items, 
> which is not unrealistic. Hashing to 64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int 
> can be used directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per 
> Rating.
> Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089656#comment-16089656
 ] 

Peng Meng commented on SPARK-21401:
---

Got it, thanks [~srowen]

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089648#comment-16089648
 ] 

Peng Meng commented on SPARK-21401:
---

Yes,  you don't need to do it for the vast majority of elements.
The key problem here is boxing and unboxing. My testing show boxing/unboxing 
consuming most of time. My micro test  even shows 80% time is spending on 
boxing/unboxing.
If redesign BoundedPriorityQueue  can avoid boxing/unboxing, it will be very 
useful. 

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089638#comment-16089638
 ] 

Peng Meng commented on SPARK-21401:
---

I mean we totally rewrite the BoundedPriorityQueue, not use Java PriorityQueue 
or Scala PriorityQueue. To create a very performance one.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089629#comment-16089629
 ] 

Peng Meng commented on SPARK-21401:
---

Thanks @srowen.
I mean for BoundedPriorityQueue, you also can offer it directly.
No need to work like this
{quote}  
private def maybeReplaceLowest(a: A): Boolean =  {
val head = underlying.peek()
if (head != null && ord.gt(a, head))  {
  underlying.poll()
  underlying.offer(a)
} 
else  {
  false
}
}
{quote}
poll then offer. poll and offer both need to reheap. 
Actually, this doesn't need reheap two times. just replace the first element 
and reheap is ok. (We should rewrite the code of BoundedPriorityQueue to do 
that)

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089606#comment-16089606
 ] 

Peng Meng commented on SPARK-21401:
---

I think the BoundedPriorityQueue should be rewritten. there are two key 
problems:
1) If insert an element: you should poll first, then offer. reheap two times. 
Actually, we just need one if rewritten the code.
2) Now, BoundedPriorityQueue use Java PriorityQueue, there are many 
boxing/unboxing, which is very time consuming. 
How do you think about it? 

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089586#comment-16089586
 ] 

Peng Meng edited comment on SPARK-21401 at 7/17/17 10:10 AM:
-

I have tested much about poll and toArray.sorted. 
If the queue is much ordered (suppose offer 2000 times for queue size 20).  Use 
pq.toArray.sorted is faster.
If the queue is much disordered (suppose offer 100 times for queue size 20),  
Use pq.poll is much faster.
So why not keep the poll function for BoundedPriorityQueue?  


was (Author: peng.m...@intel.com):
I have tested much about poll and toArray.sorted. 
If the queue is ordered.  Use pq.toArray.sorted is faster.
If the queue is much disordered,  Use pq.poll is much faster.
So why not keep the poll function for BoundedPriorityQueue?  

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089586#comment-16089586
 ] 

Peng Meng commented on SPARK-21401:
---

I have tested much about poll and toArray.sorted. 
If the queue is ordered.  Use pq.toArray.sorted is faster.
If the queue is much disordered,  Use pq.poll is much faster.
So why not keep the poll function for BoundedPriorityQueue?  

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089578#comment-16089578
 ] 

Peng Meng commented on SPARK-21401:
---

Hi [~srowen], I got why my original test pq.toArray.sorted is very slow.
My original test for sorted is using: 
{quote}pq.toArray.sorted(Ordering.By[(Int, Double), Double](-_._2)) {quote},
because  {quote} pq.toArray.sorted(-_._2)  {quote} build error. Maybe there is 
boxing/unboxing, the  performance is very bad.
Now, I use  {quote}pq.toArray.sortBy(-_._2) {quote}, the performance is good 
than poll. this 25s vs poll 26s.
Thanks.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089437#comment-16089437
 ] 

Peng Meng commented on SPARK-21401:
---

I benchmarking just change pq.toArray.sorted. and pq.poll.
pq.poll time complexity is always log(N).
How about the time complexity of quicksort when the data is already partially 
ordered. 

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089424#comment-16089424
 ] 

Peng Meng commented on SPARK-21401:
---

Hi [~srowen], for ALS optimization, the difference of using poll and sorted is 
large:
When num = 20, if use sorted here, the prediction time is about 31s, if use 
poll, the prediction time is about 26s. I think this difference is large. I 
have tested many times. The result is about the same.
https://github.com/apache/spark/pull/18624.

I am testing LDA of changing sorted to poll. 

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-16 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089209#comment-16089209
 ] 

Peng Meng commented on SPARK-21401:
---

Hi [~srowen], here we also want to get a fully sorted list by get the the top 
element N (pq.size) times.
In my test, call pq.poll  N times is much faster than pq.toArray.sorted. 


> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-13 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085874#comment-16085874
 ] 

Peng Meng commented on SPARK-21401:
---

Sure, I will add isEmpty and maybe some other functions, and tests cases. 
Thanks.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-13 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085738#comment-16085738
 ] 

Peng Meng commented on SPARK-21401:
---

Yes,  SPARK-21389 used pq.poll.  pq.poll is just a small part of SPARK-21389, 
and pq.poll can improve the performance of SPARK-21389 by 10% comparing with 
pq.toAarry.sorted. 
Maybe other places where use pq.toAarry.sorted also can use pq.poll to improve 
performance.  So I create this JIRA.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-13 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-21401:
--
Description: 
The most of BoundedPriorityQueue usages in ML/MLLIB are:
Get the value of BoundedPriorityQueue, then sort it.
For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
in ALS, pq.toArray.sorted()

The test results show using pq.poll() is much faster than sort the value.
For example, in PR https://github.com/apache/spark/pull/18624
We get the sorted value of pq by the following code:
{quote}
var size = pq.size
while(size > 0) {
size -= 1
val factor = pq.poll
}{quote} 

If using the generally used methods: pq.toArray.sorted() to get the sorted 
value of pq. There is about 10% performance reduction. 

It is good to add the poll function for BoundedPriorityQueue, since many usages 
of PQ need  the sorted value.


  was:
The most of BoundedPriorityQueue usages in ML/MLLIB are:
Get the value of BoundedPriorityQueue, then sort it.
For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
in ALS, pq.toArray.sorted()

The test results show using pq.poll() is much faster than sort the value.
For example, in PR https://github.com/apache/spark/pull/18624
We get the sorted value of pq by the following code:
{quote}
var size = pq.size
while(size > 0) {
 size -= 1
 val factor = pq.poll
}{quote} 

If using the generally used methods: pq.toArray.sorted() to get the sorted 
value of pq. There is about 10% performance reduction. 

It is good to add the poll function for BoundedPriorityQueue, since many usages 
of PQ need  the sorted value.



> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-13 Thread Peng Meng (JIRA)
Peng Meng created SPARK-21401:
-

 Summary: add poll function for BoundedPriorityQueue
 Key: SPARK-21401
 URL: https://issues.apache.org/jira/browse/SPARK-21401
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Peng Meng
Priority: Minor


The most of BoundedPriorityQueue usages in ML/MLLIB are:
Get the value of BoundedPriorityQueue, then sort it.
For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
in ALS, pq.toArray.sorted()

The test results show using pq.poll() is much faster than sort the value.
For example, in PR https://github.com/apache/spark/pull/18624
We get the sorted value of pq by the following code:
{quote}
var size = pq.size
while(size > 0) {
 size -= 1
 val factor = pq.poll
}{quote} 

If using the generally used methods: pq.toArray.sorted() to get the sorted 
value of pq. There is about 10% performance reduction. 

It is good to add the poll function for BoundedPriorityQueue, since many usages 
of PQ need  the sorted value.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21389) ALS recommendForAll optimization uses Native BLAS

2017-07-13 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-21389:
--
Description: 
In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting 
matrix multiplication, and get the topK items for each matrix. The method 
effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, 
and OpenBLAS, the performance of matrix multiplication is about 10X comparing 
with handwriting method. 

I have rewritten the code of recommendForAll with GEMM, and got about 50% 
improvement comparing with the master recommendForAll method. 

The key point of this optimization:
1), use GEMM to replace hand-written matrix multiplication.
2), Use matrix to keep temp result, largely reduce GC and computing time. The 
master method create many small objects, which causes using GEMM directly 
cannot get good performance.
3), Use sort and merge to get the topK items, which don't need to call priority 
queue two times.

Test Result:
479818 users, 13727 products, rank = 10, topK = 20.
3 workers, each with 35 cores. Native BLAS is Intel MKL.
Block Size: 1000===2000===4000===8000
Master Method:40s==39.4s===39.5s===39.1s
This Method 26.5s==25.9s===26s===27.1s

Performance Improvement: (OldTime - NewTime)/NewTime = about 50%

  was:
In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting 
matrix multiplication, and get the topK items for each matrix. The method 
effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, 
and OpenBLAS, the performance of matrix multiplication is about 10X comparing 
with handwriting method. 

I have rewritten the code of recommendForAll with GEMM, and got about 50% 
improvement comparing with the master recommendForAll method. 

The key point of this optimization:
1), use GEMM to replace hand-written matrix multiplication.
2), Use matrix to keep temp result, largely reduce GC and computing time. The 
master method create many small objects, which causes using GEMM directly 
cannot get good performance.
3), Use sort and merge to get the topK items, which don't need to call priority 
queue two times.

Test Result:
479818 users, 13727 products, rank = 10, topK = 20.
3 workers, each with 35 cores. Native BLAS is Intel MKL.
Block Size: 1000===2000===4000===8000
Master Method:40s-39.4s-39.5s39.1s
This Method 26.5s---25.9s26s-27.1s

Performance Improvement: (OldTime - NewTime)/NewTime = about 50%


> ALS recommendForAll optimization uses Native BLAS
> -
>
> Key: SPARK-21389
> URL: https://issues.apache.org/jira/browse/SPARK-21389
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting 
> matrix multiplication, and get the topK items for each matrix. The method 
> effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, 
> and OpenBLAS, the performance of matrix multiplication is about 10X comparing 
> with handwriting method. 
> I have rewritten the code of recommendForAll with GEMM, and got about 50% 
> improvement comparing with the master recommendForAll method. 
> The key point of this optimization:
> 1), use GEMM to replace hand-written matrix multiplication.
> 2), Use matrix to keep temp result, largely reduce GC and computing time. The 
> master method create many small objects, which causes using GEMM directly 
> cannot get good performance.
> 3), Use sort and merge to get the topK items, which don't need to call 
> priority queue two times.
> Test Result:
> 479818 users, 13727 products, rank = 10, topK = 20.
> 3 workers, each with 35 cores. Native BLAS is Intel MKL.
> Block Size: 1000===2000===4000===8000
> Master Method:40s==39.4s===39.5s===39.1s
> This Method 26.5s==25.9s===26s===27.1s
> Performance Improvement: (OldTime - NewTime)/NewTime = about 50%



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21389) ALS recommendForAll optimization uses Native BLAS

2017-07-13 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-21389:
--
Description: 
In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting 
matrix multiplication, and get the topK items for each matrix. The method 
effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, 
and OpenBLAS, the performance of matrix multiplication is about 10X comparing 
with handwriting method. 

I have rewritten the code of recommendForAll with GEMM, and got about 50% 
improvement comparing with the master recommendForAll method. 

The key point of this optimization:
1), use GEMM to replace hand-written matrix multiplication.
2), Use matrix to keep temp result, largely reduce GC and computing time. The 
master method create many small objects, which causes using GEMM directly 
cannot get good performance.
3), Use sort and merge to get the topK items, which don't need to call priority 
queue two times.

Test Result:
479818 users, 13727 products, rank = 10, topK = 20.
3 workers, each with 35 cores. Native BLAS is Intel MKL.
Block Size: 1000===2000===4000===8000
Master Method:40s-39.4s-39.5s39.1s
This Method 26.5s---25.9s26s-27.1s

Performance Improvement: (OldTime - NewTime)/NewTime = about 50%

  was:
In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting 
matrix multiplication, and get the topK items for each matrix. The method 
effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, 
and OpenBLAS, the performance of matrix multiplication is about 10X comparing 
with handwriting method. 

I have rewritten the code of recommendForAll with GEMM, and got about 20%-30% 
improvement comparing with the master recommendForAll method. 

Will clean the code and submit for discussion.


> ALS recommendForAll optimization uses Native BLAS
> -
>
> Key: SPARK-21389
> URL: https://issues.apache.org/jira/browse/SPARK-21389
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting 
> matrix multiplication, and get the topK items for each matrix. The method 
> effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, 
> and OpenBLAS, the performance of matrix multiplication is about 10X comparing 
> with handwriting method. 
> I have rewritten the code of recommendForAll with GEMM, and got about 50% 
> improvement comparing with the master recommendForAll method. 
> The key point of this optimization:
> 1), use GEMM to replace hand-written matrix multiplication.
> 2), Use matrix to keep temp result, largely reduce GC and computing time. The 
> master method create many small objects, which causes using GEMM directly 
> cannot get good performance.
> 3), Use sort and merge to get the topK items, which don't need to call 
> priority queue two times.
> Test Result:
> 479818 users, 13727 products, rank = 10, topK = 20.
> 3 workers, each with 35 cores. Native BLAS is Intel MKL.
> Block Size: 1000===2000===4000===8000
> Master Method:40s-39.4s-39.5s39.1s
> This Method 26.5s---25.9s26s-27.1s
> Performance Improvement: (OldTime - NewTime)/NewTime = about 50%



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21389) ALS recommendForAll optimization uses Native BLAS

2017-07-12 Thread Peng Meng (JIRA)
Peng Meng created SPARK-21389:
-

 Summary: ALS recommendForAll optimization uses Native BLAS
 Key: SPARK-21389
 URL: https://issues.apache.org/jira/browse/SPARK-21389
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Peng Meng


In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting 
matrix multiplication, and get the topK items for each matrix. The method 
effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, 
and OpenBLAS, the performance of matrix multiplication is about 10X comparing 
with handwriting method. 

I have rewritten the code of recommendForAll with GEMM, and got about 20%-30% 
improvement comparing with the master recommendForAll method. 

Will clean the code and submit for discussion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance

2017-07-06 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076152#comment-16076152
 ] 

Peng Meng edited comment on SPARK-21305 at 7/6/17 8:22 AM:
---

I tested Intel MKL and OpenBLAS by ALS Train and Prediction.
ALS Train uses BLAS SPR, ALS prediction uses BLAS GEMM.
For ALS Prediction:
OpenBLAS   Intel-MKL
MT:124 87
ST:  37 34

For ALS  Train:
 OpenBLAS   Intel-MKL
MT 70  58
ST  63  58 
Some of Native BLAS subprograms are not multi-threading, so no mater disable MT 
or not, the performance will not change. 
So disable multi-thread will not improve performance of spark JOB which use 
single thread native BLAS. 



was (Author: peng.m...@intel.com):
I tested Intel MKL and OpenBLAS by ALS Train and Prediction.
ALS Train uses BLAS SPR, ALS prediction uses BLAS GEMM.
For ALS Prediction:
OpenBLAS   Intel-MKL
MT:124 87
ST:  37 34

For ALS  Train:
MT 70  58
ST  63  58 
Some of Native BLAS subprograms are not multi-threading, so no mater disable MT 
or not, the performance will not change. 
So disable multi-thread will not improve performance of spark JOB which use 
single thread native BLAS. 


> The BKM (best known methods) of using native BLAS to improvement ML/MLLIB 
> performance
> -
>
> Key: SPARK-21305
> URL: https://issues.apache.org/jira/browse/SPARK-21305
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Critical
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to 
> improvement the performance. 
> The methods to use native BLAS is important for the performance,  sometimes 
> (high opportunity) native BLAS even causes worse performance.  
> For example, for the ALS recommendForAll method before SPARK 2.2 which uses 
> BLAS gemm for matrix multiplication. 
> If you only test the matrix multiplication performance of native BLAS gemm 
> (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native 
> BLAS is about 10X performance improvement.  But if you test the Spark Job 
> end-to-end performance, F2j is much faster than native BLAS, very 
> interesting. 
> I spend much time for this problem, and find we should not use native BLAS 
> (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. 
> By default, this native BLAS will enable multi-thread, which will conflict 
> with Spark executor.  You can use multi-thread native BLAS, but it is better 
> to disable multi-thread first. 
> https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded
> https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
> I think we should add some comments in docs/ml-guilde.md for this first. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance

2017-07-06 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076152#comment-16076152
 ] 

Peng Meng commented on SPARK-21305:
---

I tested Intel MKL and OpenBLAS by ALS Train and Prediction.
ALS Train uses BLAS SPR, ALS prediction uses BLAS GEMM.
For ALS Prediction:
OpenBLAS   Intel-MKL
MT:124 87
ST:  37 34

For ALS  Train:
MT 70  58
ST  63  58 
Some of Native BLAS subprograms are not multi-threading, so no mater disable MT 
or not, the performance will not change. 
So disable multi-thread will not improve performance of spark JOB which use 
single thread native BLAS. 


> The BKM (best known methods) of using native BLAS to improvement ML/MLLIB 
> performance
> -
>
> Key: SPARK-21305
> URL: https://issues.apache.org/jira/browse/SPARK-21305
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Critical
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to 
> improvement the performance. 
> The methods to use native BLAS is important for the performance,  sometimes 
> (high opportunity) native BLAS even causes worse performance.  
> For example, for the ALS recommendForAll method before SPARK 2.2 which uses 
> BLAS gemm for matrix multiplication. 
> If you only test the matrix multiplication performance of native BLAS gemm 
> (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native 
> BLAS is about 10X performance improvement.  But if you test the Spark Job 
> end-to-end performance, F2j is much faster than native BLAS, very 
> interesting. 
> I spend much time for this problem, and find we should not use native BLAS 
> (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. 
> By default, this native BLAS will enable multi-thread, which will conflict 
> with Spark executor.  You can use multi-thread native BLAS, but it is better 
> to disable multi-thread first. 
> https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded
> https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
> I think we should add some comments in docs/ml-guilde.md for this first. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance

2017-07-04 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074320#comment-16074320
 ] 

Peng Meng commented on SPARK-21305:
---

Thanks [~srowen] and [~yanboliang]

I will disable native BLAS MT in spark-env.sh, and upload the performance data. 
PR will submit soon. 

> The BKM (best known methods) of using native BLAS to improvement ML/MLLIB 
> performance
> -
>
> Key: SPARK-21305
> URL: https://issues.apache.org/jira/browse/SPARK-21305
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Critical
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to 
> improvement the performance. 
> The methods to use native BLAS is important for the performance,  sometimes 
> (high opportunity) native BLAS even causes worse performance.  
> For example, for the ALS recommendForAll method before SPARK 2.2 which uses 
> BLAS gemm for matrix multiplication. 
> If you only test the matrix multiplication performance of native BLAS gemm 
> (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native 
> BLAS is about 10X performance improvement.  But if you test the Spark Job 
> end-to-end performance, F2j is much faster than native BLAS, very 
> interesting. 
> I spend much time for this problem, and find we should not use native BLAS 
> (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. 
> By default, this native BLAS will enable multi-thread, which will conflict 
> with Spark executor.  You can use multi-thread native BLAS, but it is better 
> to disable multi-thread first. 
> https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded
> https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
> I think we should add some comments in docs/ml-guilde.md for this first. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance

2017-07-04 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073801#comment-16073801
 ] 

Peng Meng edited comment on SPARK-21305 at 7/4/17 3:39 PM:
---

yes, I will do that. 
Because different blas, the method to disable multi-thread is also different. 
I have not found a good way to disable it in spark code. 



was (Author: peng.m...@intel.com):
yes, I will do that. 
Because different blas, the method to disable multi-thread is also different. 
I have not found a good way be disable it in spark code. 


> The BKM (best known methods) of using native BLAS to improvement ML/MLLIB 
> performance
> -
>
> Key: SPARK-21305
> URL: https://issues.apache.org/jira/browse/SPARK-21305
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Critical
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to 
> improvement the performance. 
> The methods to use native BLAS is important for the performance,  sometimes 
> (high opportunity) native BLAS even causes worse performance.  
> For example, for the ALS recommendForAll method before SPARK 2.2 which uses 
> BLAS gemm for matrix multiplication. 
> If you only test the matrix multiplication performance of native BLAS gemm 
> (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native 
> BLAS is about 10X performance improvement.  But if you test the Spark Job 
> end-to-end performance, F2j is much faster than native BLAS, very 
> interesting. 
> I spend much time for this problem, and find we should not use native BLAS 
> (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. 
> By default, this native BLAS will enable multi-thread, which will conflict 
> with Spark executor.  You can use multi-thread native BLAS, but it is better 
> to disable multi-thread first. 
> https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded
> https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
> I think we should add some comments in docs/ml-guilde.md for this first. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance

2017-07-04 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073801#comment-16073801
 ] 

Peng Meng commented on SPARK-21305:
---

yes, I will do that. 
Because different blas, the method to disable multi-thread is also different. 
I have not found a good way be disable it in spark code. 


> The BKM (best known methods) of using native BLAS to improvement ML/MLLIB 
> performance
> -
>
> Key: SPARK-21305
> URL: https://issues.apache.org/jira/browse/SPARK-21305
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Critical
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to 
> improvement the performance. 
> The methods to use native BLAS is important for the performance,  sometimes 
> (high opportunity) native BLAS even causes worse performance.  
> For example, for the ALS recommendForAll method before SPARK 2.2 which uses 
> BLAS gemm for matrix multiplication. 
> If you only test the matrix multiplication performance of native BLAS gemm 
> (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native 
> BLAS is about 10X performance improvement.  But if you test the Spark Job 
> end-to-end performance, F2j is much faster than native BLAS, very 
> interesting. 
> I spend much time for this problem, and find we should not use native BLAS 
> (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. 
> By default, this native BLAS will enable multi-thread, which will conflict 
> with Spark executor.  You can use multi-thread native BLAS, but it is better 
> to disable multi-thread first. 
> https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded
> https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
> I think we should add some comments in docs/ml-guilde.md for this first. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance

2017-07-04 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073798#comment-16073798
 ] 

Peng Meng commented on SPARK-21305:
---

ping [~mlnick] , [~yanboliang], [~mengxr], [~srowen] 

> The BKM (best known methods) of using native BLAS to improvement ML/MLLIB 
> performance
> -
>
> Key: SPARK-21305
> URL: https://issues.apache.org/jira/browse/SPARK-21305
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Critical
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to 
> improvement the performance. 
> The methods to use native BLAS is important for the performance,  sometimes 
> (high opportunity) native BLAS even causes worse performance.  
> For example, for the ALS recommendForAll method before SPARK 2.2 which uses 
> BLAS gemm for matrix multiplication. 
> If you only test the matrix multiplication performance of native BLAS gemm 
> (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native 
> BLAS is about 10X performance improvement.  But if you test the Spark Job 
> end-to-end performance, F2j is much faster than native BLAS, very 
> interesting. 
> I spend much time for this problem, and find we should not use native BLAS 
> (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. 
> By default, this native BLAS will enable multi-thread, which will conflict 
> with Spark executor.  You can use multi-thread native BLAS, but it is better 
> to disable multi-thread first. 
> https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded
> https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
> I think we should add some comments in docs/ml-guilde.md for this first. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21305) The BKM (best known methods) of using native BLAS to improvement ML/MLLIB performance

2017-07-04 Thread Peng Meng (JIRA)
Peng Meng created SPARK-21305:
-

 Summary: The BKM (best known methods) of using native BLAS to 
improvement ML/MLLIB performance
 Key: SPARK-21305
 URL: https://issues.apache.org/jira/browse/SPARK-21305
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Peng Meng
Priority: Critical


Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to 
improvement the performance. 
The methods to use native BLAS is important for the performance,  sometimes 
(high opportunity) native BLAS even causes worse performance.  
For example, for the ALS recommendForAll method before SPARK 2.2 which uses 
BLAS gemm for matrix multiplication. 
If you only test the matrix multiplication performance of native BLAS gemm 
(like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native BLAS 
is about 10X performance improvement.  But if you test the Spark Job end-to-end 
performance, F2j is much faster than native BLAS, very interesting. 

I spend much time for this problem, and find we should not use native BLAS 
(like OpenBLAS and Intel MKL) which support multi-thread with no any setting. 
By default, this native BLAS will enable multi-thread, which will conflict with 
Spark executor.  You can use multi-thread native BLAS, but it is better to 
disable multi-thread first. 

https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded
https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications

I think we should add some comments in docs/ml-guilde.md for this first. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-05-23 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-20443:
--
Attachment: blockSize.jpg

blockSize and the performance of ALS recommendForAll

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
> Attachments: blockSize.jpg
>
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20764) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version

2017-05-22 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020662#comment-16020662
 ] 

Peng Meng commented on SPARK-20764:
---

I will submit a PR to cover more tests for model summary, thanks.

> Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and 
> GLR - Python version
> 
>
> Key: SPARK-20764
> URL: https://issues.apache.org/jira/browse/SPARK-20764
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Assignee: Peng Meng
>Priority: Minor
> Fix For: 2.2.0
>
>
> SPARK-20097 exposed {{degreesOfFreedom}} in {{LinearRegressionSummary}} and 
> {{numInstances}} in {{GeneralizedLinearRegressionSummary}}. Python API should 
> be updated to reflect these changes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20764) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version

2017-05-22 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020555#comment-16020555
 ] 

Peng Meng commented on SPARK-20764:
---

Thanks [~yanboliang],  is that check too short?
For the PySpark LinearRegressionSummary test. There is only two lines of code, 
just check one function.
If there is a typo, suppose I write  [return 
self._call_java("degreeOfFreedom")]  (degreeOfFreedom to replace 
degreesOfFreedom)
There will not return any error, and pass all test. But this certainly doesn't 
work.

> Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and 
> GLR - Python version
> 
>
> Key: SPARK-20764
> URL: https://issues.apache.org/jira/browse/SPARK-20764
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Assignee: Peng Meng
>Priority: Minor
> Fix For: 2.2.0
>
>
> SPARK-20097 exposed {{degreesOfFreedom}} in {{LinearRegressionSummary}} and 
> {{numInstances}} in {{GeneralizedLinearRegressionSummary}}. Python API should 
> be updated to reflect these changes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20764) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version

2017-05-22 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019520#comment-16019520
 ] 

Peng Meng commented on SPARK-20764:
---

Hi [~mlnick] , why there is no any test code for some classes in PySpark, like 
LinearRegressionSummary, and GeneralizedLinearRegressionSummary.


> Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and 
> GLR - Python version
> 
>
> Key: SPARK-20764
> URL: https://issues.apache.org/jira/browse/SPARK-20764
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-20097 exposed {{degreesOfFreedom}} in {{LinearRegressionSummary}} and 
> {{numInstances}} in {{GeneralizedLinearRegressionSummary}}. Python API should 
> be updated to reflect these changes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20764) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version

2017-05-21 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019096#comment-16019096
 ] 

Peng Meng commented on SPARK-20764:
---

Hi [~mlnick], are you working on this, if not, I can work on it.

> Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and 
> GLR - Python version
> 
>
> Key: SPARK-20764
> URL: https://issues.apache.org/jira/browse/SPARK-20764
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-20097 exposed {{degreesOfFreedom}} in {{LinearRegressionSummary}} and 
> {{numInstances}} in {{GeneralizedLinearRegressionSummary}}. Python API should 
> be updated to reflect these changes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983074#comment-15983074
 ] 

Peng Meng commented on SPARK-11968:
---

Thanks [~mlnick] , I will post more results here. 
I latest result is I have changed the PriorityQueue to BoundedPriorityQueue. 
There is about 30% improvement. Will update the PR and  the result. 

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-04-25 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983059#comment-15983059
 ] 

Peng Meng commented on SPARK-20443:
---

Yes, based on my current test, I agree.
But if the data size is large,  maybe there is benefit to adjust block size. 

> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-25 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982251#comment-15982251
 ] 

Peng Meng edited comment on SPARK-20446 at 4/25/17 3:06 PM:


Yes, I compared with ML ALSModel.recommendAll. The data size is 480,000*17,000, 
comparing with mllib recommendAll, there is about 10% to 20% improvement. 

Blockify is very important for the performance of our solution. Because the 
compute of CartesianRDD is:
for (x <- rdd1.iterator(currSplit.s1, context); y <- 
rdd2.iterator(currSplit.s2, context)) yield (x, y)
If no blockify, rdd2.iterator will be called the numberOfRecords times of rdd1.
With blockify, rdd2.iterator will be called the numberOfGroups times of rdd1.
The performance difference is very large.

SparkSQL approach is much different from my solution. It uses crossJoin to do 
Cartesian product, the key optimization is in UnsafeCartesianRDD, which uses 
rowArray to cache the right partition. 
 
The key part of our method:
1. use blockify to optimize the Cartesian production computation, not use 
blockify for BLAS 3.
2. use priorityQueue to save memory (used on each group), not to find TopK 
product for each user.
SqarkSQL approach doesn't work like this.   
cc [~mengxr]




was (Author: peng.m...@intel.com):
Yes, I compared with ML ALSModel.recommendAll. The data size is 480,000*17,000, 
comparing with mllib recommendAll, there is about 10% to 20% improvement. 

Blockify is very important the performance of our solution. Because the compute 
of CartesianRDD is:
for (x <- rdd1.iterator(currSplit.s1, context); y <- 
rdd2.iterator(currSplit.s2, context)) yield (x, y)
If no blockify, rdd2.iterator will be called the numberOfRecords times of rdd1.
With blockify, rdd2.iterator will be called the numberOfGroups times of rdd1.
The performance difference is very large.

SparkSQL approach is much different from my solution. It uses crossJoin to do 
Cartesian product, the key optimization is in UnsafeCartesianRDD, which uses 
rowArray to cache the right partition. 
 
The key part of our method:
1. use blockify to optimize the Cartesian production computation, not use 
blockify for BLAS 3.
2. use priorityQueue to save memory (used on each group), not to find TopK 
product for each user.
SqarkSQL approach doesn't work like this.   
cc [~mengxr]



> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-25 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983030#comment-15983030
 ] 

Peng Meng commented on SPARK-20446:
---

Thanks [~mlnick] , I agree with you. I am ok to close this ticket and move the 
discussion to SPARK-11968.

> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982251#comment-15982251
 ] 

Peng Meng commented on SPARK-20446:
---

Yes, I compared with ML ALSModel.recommendAll. The data size is 480,000*17,000, 
comparing with mllib recommendAll, there is about 10% to 20% improvement. 

Blockify is very important the performance of our solution. Because the compute 
of CartesianRDD is:
for (x <- rdd1.iterator(currSplit.s1, context); y <- 
rdd2.iterator(currSplit.s2, context)) yield (x, y)
If no blockify, rdd2.iterator will be called the numberOfRecords times of rdd1.
With blockify, rdd2.iterator will be called the numberOfGroups times of rdd1.
The performance difference is very large.

SparkSQL approach is much different from my solution. It uses crossJoin to do 
Cartesian product, the key optimization is in UnsafeCartesianRDD, which uses 
rowArray to cache the right partition. 
 
The key part of our method:
1. use blockify to optimize the Cartesian production computation, not use 
blockify for BLAS 3.
2. use priorityQueue to save memory (used on each group), not to find TopK 
product for each user.
SqarkSQL approach doesn't work like this.   
cc [~mengxr]



> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981148#comment-15981148
 ] 

Peng Meng edited comment on SPARK-20446 at 4/24/17 4:18 PM:


Thanks [~mlnick], I also compared DataFrame Version ALS recommendForAll, but no 
big performance improvement found. 
In our solution, 
1, We use BLAS 2 to replace BLAS 3, which reduce much of GC caused by BLAS 
computation.
2. We use output = Array[(Int, (Int, Double))](4096*topK) to replace output = 
Array[(Int, (Int, Double))](4096*4096), which largely reduce the memory 
allocation. 
3. We use priorityQueue for the Sort, which improve about 40% compared with 
general Sort.

In our experiment with different configuration (different number of machines, 
different number of cores of each machine), this solution is about 5x-9x 
improvement compared with the current method when the blockSize is 4096. 

There is no OOM for the this solution, and the performance is about the same 
with different blockSize.  For the old method, the performance is highly 
related with blockSize.
cc [~mengxr]





was (Author: peng.m...@intel.com):
Thanks [~mlnick], I also compared DataFrame Version ALS recommendForAll, but no 
big performance improvement found. 
In our solution, 
1, We use BLAS 2 to replace BLAS 3, which reduce much of GC caused by BLAS 
computation.
2. We use output = Array[(Int, (Int, Double))](4096*topK) to replace output = 
Array[(Int, (Int, Double))](4096*4096), which largely reduce the memory 
allocation. 
3. We use priorityQueue for the Sort, which improve about 40% compared with 
general Sort.

cc [~mengxr]




> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981148#comment-15981148
 ] 

Peng Meng commented on SPARK-20446:
---

Thanks [~mlnick], I also compared DataFrame Version ALS recommendForAll, but no 
big performance improvement found. 
In our solution, 
1, We use BLAS 2 to replace BLAS 3, which reduce much of GC caused by BLAS 
computation.
2. We use output = Array[(Int, (Int, Double))](4096*topK) to replace output = 
Array[(Int, (Int, Double))](4096*4096), which largely reduce the memory 
allocation. 
3. We use priorityQueue for the Sort, which improve about 40% compared with 
general Sort.

cc [~mengxr]




> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981115#comment-15981115
 ] 

Peng Meng commented on SPARK-20446:
---

I think you said: https://github.com/apache/spark/pull/9980
Maybe the problem is the same, but the solution is totally different.

PR9980 still use BLAS 3 in the solution. GC is caused by BLAS 3, so 9980 method 
cannot solve GC problem. 

Very glad to know many people have studied this problem. 


> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-20446:
--
Description: 
The recommendForAll of MLLIB ALS is very slow.
GC is a key problem of the current method. 
The task use the following code to keep temp result:
val output = new Array[(Int, (Int, Double))](m*n)
m = n = 4096 (default value, no method to set)
so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
cause serious GC problem, and it is frequently OOM.

Actually, we don't need to save all the temp result. Suppose we recommend topK 
(topK is about 10, or 20) product for each user, we only need  4k * topK * (4 + 
4 + 8) memory to save the temp result.
I have written a solution for this method with the following test result. 

The Test Environment:
3 workers: each work 10 core, each work 30G memory, each work 1 executor.
The Data: User 480,000, and Item 17,000

BlockSize: 1024 2048 4096 8192
Old method: 245s 332s 488s OOM
This solution: 121s 118s 117s 120s

 

  was:
The recommendForAll of MLLIB ALS is very slow.
GC is a key problem of the current method. 
The task use the following code to keep temp result:
val output = new Array[(Int, (Int, Double))](m*n)
m = n = 4096 (default value, no method to set)
so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
cause serious GC problem, and it is frequently OOM.

Actually, we don't need to save all the temp result. Support we recommend topK 
(topK is about 10, or 20) product for each user, we only need  4k * topK * (4 + 
4 + 8) memory to save the temp result.

 


> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Peng Meng (JIRA)
Peng Meng created SPARK-20446:
-

 Summary: Optimize the process of MLLIB ALS recommendForAll
 Key: SPARK-20446
 URL: https://issues.apache.org/jira/browse/SPARK-20446
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Peng Meng


The recommendForAll of MLLIB ALS is very slow.
GC is a key problem of the current method. 
The task use the following code to keep temp result:
val output = new Array[(Int, (Int, Double))](m*n)
m = n = 4096 (default value, no method to set)
so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
cause serious GC problem, and it is frequently OOM.

Actually, we don't need to save all the temp result. Support we recommend topK 
(topK is about 10, or 20) product for each user, we only need  4k * topK * (4 + 
4 + 8) memory to save the temp result.

 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-04-24 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-20443:
--
Description: 
The blockSize of MLLIB ALS is very important for ALS performance. 
In our test, when the blockSize is 128, the performance is about 4X comparing 
with the blockSize is 4096 (default value).
The following are our test results: 
BlockSize(recommendationForAll time)
128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)

The Test Environment:
3 workers: each work 10 core, each work 30G memory, each work 1 executor.
The Data: User 480,000, and Item 17,000

  was:
The blockSize of MLLIB ALS is very important for ALS performance. 
In our test, when the blockSize is 128, the performance is about 4X comparing 
with the blockSize is 4096 (default value).
The following are our test results: 
BlockSize(recommendationForAll time)
128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)

The Test Environment:
3 workers: each work 10 core, each work 30G memory, each work 1 executor.
The Data: User 48W, and Item 1.7W


> The blockSize of MLLIB ALS should be setting  by the User
> -
>
> Key: SPARK-20443
> URL: https://issues.apache.org/jira/browse/SPARK-20443
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The blockSize of MLLIB ALS is very important for ALS performance. 
> In our test, when the blockSize is 128, the performance is about 4X comparing 
> with the blockSize is 4096 (default value).
> The following are our test results: 
> BlockSize(recommendationForAll time)
> 128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-04-23 Thread Peng Meng (JIRA)
Peng Meng created SPARK-20443:
-

 Summary: The blockSize of MLLIB ALS should be setting  by the User
 Key: SPARK-20443
 URL: https://issues.apache.org/jira/browse/SPARK-20443
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Peng Meng
Priority: Minor


The blockSize of MLLIB ALS is very important for ALS performance. 
In our test, when the blockSize is 128, the performance is about 4X comparing 
with the blockSize is 4096 (default value).
The following are our test results: 
BlockSize(recommendationForAll time)
128(124s), 256(160s), 512(184s), 1024(244s), 2048(332s), 4096(488s), 8192(OOM)

The Test Environment:
3 workers: each work 10 core, each work 30G memory, each work 1 executor.
The Data: User 48W, and Item 1.7W



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-26 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15608322#comment-15608322
 ] 

Peng Meng commented on SPARK-18088:
---

I am neutral for changing the selectorType "KBest" to "numTopFeatures".
 "KBest", "percentile", and “fpr” is the same as the selector name in sklearn.

Now,  the selectorType and parameter pairs are:
kbest=>numTopFeatures, percentile=>percentile, fpr=>alpha
After the change:
numTopFeatures=>numTopFeatures, percentile=>percentile, fpr=>alpha
The sklearn pairs are:
kbest=>k, percentile=>percentile, fpr=>alpha

Change the parameter numTopFeatures to k maybe also an option.

Hi [~srowen], what's your idea about this change?

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605231#comment-15605231
 ] 

Peng Meng commented on SPARK-18088:
---

In the previous implementation,  testing against only the statistic is not 
right. 
So I submit https://issues.apache.org/jira/browse/SPARK-17870 to fix that bug. 
Testing against only the p-value is ok. 3 of 5 feature selection methods of 
sklearn are only based on p-value. The other two is based on statistic. Because 
the degree of freedom is the same when compute chiSquare value, so sklearn can 
use statistic. 

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups
> One major item: FPR is not implemented correctly.  Testing against only the 
> p-value and not the test statistic does not really tell you anything.  We 
> should follow sklearn, which allows a p-value threshold for any selection 
> method: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html]
> * In this PR, I'm just going to remove FPR completely.  We can add it back in 
> a follow-up PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605211#comment-15605211
 ] 

Peng Meng commented on SPARK-18088:
---

Hi [~josephkb] , I am not quite understand "Testing against only the p-value 
and not the test statistic does not really tell you anything. " SelectFPR in 
sklearn is only based on pValue. 

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups
> One major item: FPR is not implemented correctly.  Testing against only the 
> p-value and not the test statistic does not really tell you anything.  We 
> should follow sklearn, which allows a p-value threshold for any selection 
> method: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html]
> * In this PR, I'm just going to remove FPR completely.  We can add it back in 
> a follow-up PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18062) ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should return probabilities when given all-0 vector

2016-10-23 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600926#comment-15600926
 ] 

Peng Meng commented on SPARK-18062:
---

This relate to how to understand all-0 rawPrediction, all classes are 
impossible or all classes are the same? If the later,  
ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should return 
a valid probability vector with the uniform distribution.


> ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should 
> return probabilities when given all-0 vector
> 
>
> Key: SPARK-18062
> URL: https://issues.apache.org/jira/browse/SPARK-18062
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> {{ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace}} returns 
> a vector of all-0 when given a rawPrediction vector of all-0.  It should 
> return a valid probability vector with the uniform distribution.
> Note: This will be a *behavior change* but it should be very minor and affect 
> few if any users.  But we should note it in release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567170#comment-15567170
 ] 

Peng Meng commented on SPARK-17870:
---

hi [~avulanov], the question here is not use raw chi2 scores or pvalues, the 
question is if use raw chi2 scores, the DoF should be the same.   
"chi2-test is used multiple times" is another problem.  According to 
(http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html),"whenever
 a statistical test is used multiple times, then the probability of getting at 
least one error increases.", this problem is partially solved by Select the 
p-values corresponding to Family-wise error rate (SelectFwe, SPARK-17645). 
Thanks very much.

Hi [~srowen], I totally agree with your comments. Based on the DoF is different 
in Spark ChiSquare value, we can use the p-values for Spark SelectKBest, and 
SelectPercentile. Thanks very much.

I will submit a pr for this.

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-17870:
--
Summary: ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is 
wrong   (was: ML/MLLIB: Statistics.chiSqTest(RDD) is wrong )

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565315#comment-15565315
 ] 

Peng Meng commented on SPARK-17870:
---

https://github.com/apache/spark/pull/1484#issuecomment-51024568
Hi [~mengxr] and [~avulanov] , what do you think about this JIRA. 

> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> -
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565251#comment-15565251
 ] 

Peng Meng commented on SPARK-17870:
---

The scikit learn code is here: 
https://github.com/scikit-learn/scikit-learn/blob/412996f09b6756752dfd3736c306d46fca8f1aa1/sklearn/feature_selection/univariate_selection.py,
 line 422 for selectKBest, chiSquare compute is also on the same page.

For the last example, each row of X is a sample, it contain three features, 
totally 4 samples. Y is the label.
Thanks very much.  


> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> -
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565225#comment-15565225
 ] 

Peng Meng commented on SPARK-17870:
---

yes, the selectKBest and selectPercentile in scikit learn only use statistic.
Because the method to count ChiSquare value is different, the DoF of all 
features in scikit learn are the same. so it can do that.

The ChiSquare Value compute process is like this:
 suppose we have data:
X = [ 8 7 0
 0 9 6
 0 9 8
 8 9 5]
y = [0 1 1 2]T, this is the test suite data of 
ml/feature/ChiSquareSelectorSuite.scala
sci-kit learn to compute chiSquare value is like this:
first:
Y = [1 0 0
0 1 0
0  1 0
0  0 1]
observed = Y'*X=
[8  70
 0  18 14
 8   9   5]
expected = 
[4 8.5 4.75
 8 17  9.5
 4  8.5  4.75]
_chisquare(ovserved, expected): to compute all features ChiSquare value, we can 
see all the DF of each feature is the same.

Bug for spark Statistics.chiSqTest(RDD), is use another method, for each 
feature, construct a contingency table. So the DF is different for each 
feature.  

For "gives different results from ranking on the statistic", this is because 
the parameters different.
For previous example, if use SelectKBest(2), the selected feature is the same 
as SelectFpr(0.2) in scikit learn


 


> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> -
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565041#comment-15565041
 ] 

Peng Meng commented on SPARK-17870:
---

hi [~srowen], thanks very much for you quickly reply. 
yes,the p-value is better than raw statistic in this case, because p-value is 
count  based on DoF and raw statistic.
raw statistic is also popular for feature selection. The SelectKBest and 
SelectPercentile in scikit learn is based on raw statistic. 
The question here is we should use the same DoF like scikit learn to count 
ChiSquare value. 
For this JIRA, I propose to change the method to count ChiSquare value like 
what is done in scikit learn (change Statistics.chiSqTest(RDD)). 

Thanks very much.  

> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> -
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)
Peng Meng created SPARK-17870:
-

 Summary: ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
 Key: SPARK-17870
 URL: https://issues.apache.org/jira/browse/SPARK-17870
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Peng Meng
Priority: Critical


The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
(line 233) is wrong.

For feature selection method ChiSquareSelector, it is based on the 
ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
select the features with the largest ChiSqure value. But the Degree of Freedom 
(df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for 
different df, you cannot base on ChiSqure value to select features.

Because of the wrong method to count ChiSquare value, the feature selection 
results are strange.
Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
If use selectKBest to select: the feature 3 will be selected.
If use selectFpr to select: feature 1 and 2 will be selected. 
This is strange. 

I use scikit learn to test the same data with the same parameters. 
When use selectKBest to select: feature 1 will be selected. 
When use selectFpr to select: feature 1 and 2 will be selected. 
This result is make sense. because the df of each feature in scikit learn is 
the same.

I plan to submit a PR for this problem.
 

 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17645) Add feature selector methods based on: False Discovery Rate (FDR) and Family Wise Error rate (FWE)

2016-09-23 Thread Peng Meng (JIRA)
Peng Meng created SPARK-17645:
-

 Summary: Add feature selector methods based on: False Discovery 
Rate (FDR) and Family Wise Error rate (FWE)
 Key: SPARK-17645
 URL: https://issues.apache.org/jira/browse/SPARK-17645
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peng Meng
Priority: Minor


Univariate feature selection works by selecting the best features based on 
univariate statistical tests. 
FDR and FWE are a popular univariate statistical test for feature selection.

In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 
25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure 
in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. 
In statistics, FWE is the probability of making one or more false discoveries, 
or type I errors, among all the hypotheses when performing multiple hypotheses 
tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented 
in scikit-learn. 
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17505) Add setBins for BinaryClassificationMetrics in mlllb/evaluation

2016-09-12 Thread Peng Meng (JIRA)
Peng Meng created SPARK-17505:
-

 Summary: Add setBins for BinaryClassificationMetrics in 
mlllb/evaluation
 Key: SPARK-17505
 URL: https://issues.apache.org/jira/browse/SPARK-17505
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Peng Meng
Priority: Minor


Add a setBins method for BinaryClassificationMetrics.
BinaryClassificationMetrics is a class in mllib/Evaluation. numBins is a key 
attribute of it. If numBins greater than 0, then the curves (ROC curve, PR 
curve) computed internally will be down-sampled to this many "bins". 
It is useful to let the user set the numBins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17462) Check for places within MLlib which should use VersionUtils to parse Spark version strings

2016-09-12 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15483336#comment-15483336
 ] 

Peng Meng commented on SPARK-17462:
---

hi [~josephkb], I am busy this days, I am glad VinceShieh can help on this. 

> Check for places within MLlib which should use VersionUtils to parse Spark 
> version strings
> --
>
> Key: SPARK-17462
> URL: https://issues.apache.org/jira/browse/SPARK-17462
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> [SPARK-17456] creates a utility for parsing Spark versions.  Several places 
> in MLlib use custom regexes or other approaches.  Those should be fixed to 
> use the new (tested) utility.  This task is for checking and replacing those 
> instances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17462) Check for places within MLlib which should use VersionUtils to parse Spark version strings

2016-09-09 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15477019#comment-15477019
 ] 

Peng Meng commented on SPARK-17462:
---

Hi [~josephkb], will you work on this, if not, I can work on it. thanks.

> Check for places within MLlib which should use VersionUtils to parse Spark 
> version strings
> --
>
> Key: SPARK-17462
> URL: https://issues.apache.org/jira/browse/SPARK-17462
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> [SPARK-17456] creates a utility for parsing Spark versions.  Several places 
> in MLlib use custom regexes or other approaches.  Those should be fixed to 
> use the new (tested) utility.  This task is for checking and replacing those 
> instances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-09-08 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475860#comment-15475860
 ] 

Peng Meng commented on SPARK-6160:
--

hi [~GayathriMurali], are you still working on this, if not, I can work on it. 
thanks.

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-09-08 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475769#comment-15475769
 ] 

Peng Meng commented on SPARK-6160:
--

Hi [~josephkb], I have some discussion with [~srowen] about keeping test 
statistic info of ChiSqSelector in PR: 
https://github.com/apache/spark/pull/14597. 
Can you review that PR, and commit: 
https://github.com/apache/spark/pull/14597/commits/3d6aecb8441503c9c3d62a2d8a3d48824b9d6637


> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >