[jira] [Created] (SPARK-32005) Add aggregate functions for computing percentiles on weighted data
Devesh Parekh created SPARK-32005: - Summary: Add aggregate functions for computing percentiles on weighted data Key: SPARK-32005 URL: https://issues.apache.org/jira/browse/SPARK-32005 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0, 2.4.6 Reporter: Devesh Parekh SPARK-30569 adds percentile_approx functions for computing percentiles for a column with equal weights. It would be useful to have variants that also take a weight column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15786390#comment-15786390 ] Devesh Parekh commented on SPARK-18693: --- I suggest this is more appropriately classified as a bug rather than an improvement. Users who follow the documentation to use CrossValidator for model selection with these evaluators and weighted input will get wrong results. At the very least, the user should be warned in the documentation that the results will be wrong if they fit a weight-aware model on weighted input and use these existing evaluators in CrossValidator. With that warning in place, making the evaluators work on weighted input would then be an improvement. > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devesh Parekh updated SPARK-18693: -- Description: The LogisticRegression and LinearRegression models support training with a weight column, but the corresponding evaluators do not support computing metrics using those weights. This breaks model selection using CrossValidator. (was: The LogisticRegression and LinearRegression models support training with a weight column, but the corresponding evaluators do not support computing metrics using those weights.) > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
Devesh Parekh created SPARK-18693: - Summary: BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data Key: SPARK-18693 URL: https://issues.apache.org/jira/browse/SPARK-18693 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.0.2 Reporter: Devesh Parekh The LogisticRegression and LinearRegression models support training with a weight column, but the corresponding evaluators do not support computing metrics using those weights. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2505) Weighted Regularizer
[ https://issues.apache.org/jira/browse/SPARK-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495170#comment-14495170 ] Devesh Parekh edited comment on SPARK-2505 at 4/14/15 11:57 PM: Can you describe a case where you would want the weights would be anything other than 0 for the intercept and lambda for everything else? The unregularized intercept use case comes up very often, so the API for this case should be very simple. was (Author: dparekh): Can you describe a case where you would want the weights would be anything other than 0 for the intercept and lambda for everything else? > Weighted Regularizer > > > Key: SPARK-2505 > URL: https://issues.apache.org/jira/browse/SPARK-2505 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: DB Tsai > > The current implementation of regularization in linear model is using > `Updater`, and this design has couple issues as the following. > 1) It will penalize all the weights including intercept. In machine learning > training process, typically, people don't penalize the intercept. > 2) The `Updater` has the logic of adaptive step size for gradient decent, and > we would like to clean it up by separating the logic of regularization out > from updater to regularizer so in LBFGS optimizer, we don't need the trick > for getting the loss and gradient of objective function. > In this work, a weighted regularizer will be implemented, and users can > exclude the intercept or any weight from regularization by setting that term > with zero weighted penalty. Since the regularizer will return a tuple of loss > and gradient, the adaptive step size logic, and soft thresholding for L1 in > Updater will be moved to SGD optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2505) Weighted Regularizer
[ https://issues.apache.org/jira/browse/SPARK-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495170#comment-14495170 ] Devesh Parekh commented on SPARK-2505: -- Can you describe a case where you would want the weights would be anything other than 0 for the intercept and lambda for everything else? > Weighted Regularizer > > > Key: SPARK-2505 > URL: https://issues.apache.org/jira/browse/SPARK-2505 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: DB Tsai > > The current implementation of regularization in linear model is using > `Updater`, and this design has couple issues as the following. > 1) It will penalize all the weights including intercept. In machine learning > training process, typically, people don't penalize the intercept. > 2) The `Updater` has the logic of adaptive step size for gradient decent, and > we would like to clean it up by separating the logic of regularization out > from updater to regularizer so in LBFGS optimizer, we don't need the trick > for getting the loss and gradient of objective function. > In this work, a weighted regularizer will be implemented, and users can > exclude the intercept or any weight from regularization by setting that term > with zero weighted penalty. Since the regularizer will return a tuple of loss > and gradient, the adaptive step size logic, and soft thresholding for L1 in > Updater will be moved to SGD optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6162) Handle missing values in GBM
Devesh Parekh created SPARK-6162: Summary: Handle missing values in GBM Key: SPARK-6162 URL: https://issues.apache.org/jira/browse/SPARK-6162 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: Devesh Parekh We build a lot of predictive models over data combined from multiple sources, where some entries may not have all sources of data and so some values are missing in each feature vector. Another place this might come up is if you have features from slightly heterogeneous items (or items composed of heterogeneous subcomponents) that share many features in common but may have extra features for different types, and you don't want to manually train models for every different type. R's GBM library, which is what we are currently using, deals with this type of data nicely by making "missing" nodes in the decision tree (a surrogate split) for features that can have missing values. We'd like to do the same with MLLib, but LabeledPoint would need to support missing values, and GradientBoostedTrees would need to be modified to deal with them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5809) OutOfMemoryError in logDebug in RandomForest.scala
[ https://issues.apache.org/jira/browse/SPARK-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324862#comment-14324862 ] Devesh Parekh commented on SPARK-5809: -- This was a naive run of GBM on TFIDF vectors produced by HashingTF (https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF), which creates 2^20 features (more than a million). What is the maximum number of features that GradientBoostedTrees will work for? I'll do a dimensionality reduction before trying again. > OutOfMemoryError in logDebug in RandomForest.scala > -- > > Key: SPARK-5809 > URL: https://issues.apache.org/jira/browse/SPARK-5809 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Devesh Parekh >Assignee: Joseph K. Bradley >Priority: Minor > Labels: easyfix > > When training a GBM on sparse vectors produced by HashingTF, I get the > following OutOfMemoryError, where RandomForest is building a debug string to > log. > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:3326) > at > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) > at > java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121 > ) > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) > at java.lang.StringBuilder.append(StringBuilder.java:136) > at > scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197) > at > scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:327 > ) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) > at > scala.collection.AbstractTraversable.addString(Traversable.scala:105) > at > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) > at > scala.collection.AbstractTraversable.mkString(Traversable.scala:105) > at > scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) > at > scala.collection.AbstractTraversable.mkString(Traversable.scala:105) > at > org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152) > at > org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152) > at org.apache.spark.Logging$class.logDebug(Logging.scala:63) > at > org.apache.spark.mllib.tree.RandomForest.logDebug(RandomForest.scala:67) > at > org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:150) > at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:64) > at > org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150) > at > org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63) > > at > org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96) > A workaround until this is fixed is to modify log4j.properties in the conf > directory to filter out debug logs in RandomForest. For example: > log4j.logger.org.apache.spark.mllib.tree.RandomForest=WARN -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5809) OutOfMemoryError in logDebug in RandomForest.scala
Devesh Parekh created SPARK-5809: Summary: OutOfMemoryError in logDebug in RandomForest.scala Key: SPARK-5809 URL: https://issues.apache.org/jira/browse/SPARK-5809 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Devesh Parekh When training a GBM on sparse vectors produced by HashingTF, I get the following OutOfMemoryError, where RandomForest is building a debug string to log. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3326) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121 ) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) at java.lang.StringBuilder.append(StringBuilder.java:136) at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197) at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:327 ) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320) at scala.collection.AbstractTraversable.addString(Traversable.scala:105) at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286) at scala.collection.AbstractTraversable.mkString(Traversable.scala:105) at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288) at scala.collection.AbstractTraversable.mkString(Traversable.scala:105) at org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152) at org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152) at org.apache.spark.Logging$class.logDebug(Logging.scala:63) at org.apache.spark.mllib.tree.RandomForest.logDebug(RandomForest.scala:67) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:150) at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:64) at org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150) at org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63) at org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96) A workaround until this is fixed is to modify log4j.properties in the conf directory to filter out debug logs in RandomForest. For example: log4j.logger.org.apache.spark.mllib.tree.RandomForest=WARN -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org