[jira] [Created] (SPARK-32005) Add aggregate functions for computing percentiles on weighted data

2020-06-16 Thread Devesh Parekh (Jira)
Devesh Parekh created SPARK-32005:
-

 Summary: Add aggregate functions for computing percentiles on 
weighted data
 Key: SPARK-32005
 URL: https://issues.apache.org/jira/browse/SPARK-32005
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0, 2.4.6
Reporter: Devesh Parekh


SPARK-30569 adds percentile_approx functions for computing percentiles for a 
column with equal weights. It would be useful to have variants that also take a 
weight column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data

2016-12-29 Thread Devesh Parekh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15786390#comment-15786390
 ] 

Devesh Parekh commented on SPARK-18693:
---

I suggest this is more appropriately classified as a bug rather than an 
improvement. Users who follow the documentation to use CrossValidator for model 
selection with these evaluators and weighted input will get wrong results. At 
the very least, the user should be warned in the documentation that the results 
will be wrong if they fit a weight-aware model on weighted input and use these 
existing evaluators in CrossValidator. With that warning in place, making the 
evaluators work on weighted input would then be an improvement.

> BinaryClassificationEvaluator, RegressionEvaluator, and 
> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-18693
> URL: https://issues.apache.org/jira/browse/SPARK-18693
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Devesh Parekh
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data

2016-12-29 Thread Devesh Parekh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devesh Parekh updated SPARK-18693:
--
Description: The LogisticRegression and LinearRegression models support 
training with a weight column, but the corresponding evaluators do not support 
computing metrics using those weights. This breaks model selection using 
CrossValidator.  (was: The LogisticRegression and LinearRegression models 
support training with a weight column, but the corresponding evaluators do not 
support computing metrics using those weights.)

> BinaryClassificationEvaluator, RegressionEvaluator, and 
> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-18693
> URL: https://issues.apache.org/jira/browse/SPARK-18693
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Devesh Parekh
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data

2016-12-02 Thread Devesh Parekh (JIRA)
Devesh Parekh created SPARK-18693:
-

 Summary: BinaryClassificationEvaluator, RegressionEvaluator, and 
MulticlassClassificationEvaluator should use sample weight data
 Key: SPARK-18693
 URL: https://issues.apache.org/jira/browse/SPARK-18693
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.2
Reporter: Devesh Parekh


The LogisticRegression and LinearRegression models support training with a 
weight column, but the corresponding evaluators do not support computing 
metrics using those weights.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2505) Weighted Regularizer

2015-04-14 Thread Devesh Parekh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495170#comment-14495170
 ] 

Devesh Parekh edited comment on SPARK-2505 at 4/14/15 11:57 PM:


Can you describe a case where you would want the weights would be anything 
other than 0 for the intercept and lambda for everything else? The 
unregularized intercept use case comes up very often, so the API for this case 
should be very simple.


was (Author: dparekh):
Can you describe a case where you would want the weights would be anything 
other than 0 for the intercept and lambda for everything else?

> Weighted Regularizer
> 
>
> Key: SPARK-2505
> URL: https://issues.apache.org/jira/browse/SPARK-2505
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: DB Tsai
>
> The current implementation of regularization in linear model is using 
> `Updater`, and this design has couple issues as the following.
> 1) It will penalize all the weights including intercept. In machine learning 
> training process, typically, people don't penalize the intercept. 
> 2) The `Updater` has the logic of adaptive step size for gradient decent, and 
> we would like to clean it up by separating the logic of regularization out 
> from updater to regularizer so in LBFGS optimizer, we don't need the trick 
> for getting the loss and gradient of objective function.
> In this work, a weighted regularizer will be implemented, and users can 
> exclude the intercept or any weight from regularization by setting that term 
> with zero weighted penalty. Since the regularizer will return a tuple of loss 
> and gradient, the adaptive step size logic, and soft thresholding for L1 in 
> Updater will be moved to SGD optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2505) Weighted Regularizer

2015-04-14 Thread Devesh Parekh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495170#comment-14495170
 ] 

Devesh Parekh commented on SPARK-2505:
--

Can you describe a case where you would want the weights would be anything 
other than 0 for the intercept and lambda for everything else?

> Weighted Regularizer
> 
>
> Key: SPARK-2505
> URL: https://issues.apache.org/jira/browse/SPARK-2505
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: DB Tsai
>
> The current implementation of regularization in linear model is using 
> `Updater`, and this design has couple issues as the following.
> 1) It will penalize all the weights including intercept. In machine learning 
> training process, typically, people don't penalize the intercept. 
> 2) The `Updater` has the logic of adaptive step size for gradient decent, and 
> we would like to clean it up by separating the logic of regularization out 
> from updater to regularizer so in LBFGS optimizer, we don't need the trick 
> for getting the loss and gradient of objective function.
> In this work, a weighted regularizer will be implemented, and users can 
> exclude the intercept or any weight from regularization by setting that term 
> with zero weighted penalty. Since the regularizer will return a tuple of loss 
> and gradient, the adaptive step size logic, and soft thresholding for L1 in 
> Updater will be moved to SGD optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6162) Handle missing values in GBM

2015-03-04 Thread Devesh Parekh (JIRA)
Devesh Parekh created SPARK-6162:


 Summary: Handle missing values in GBM
 Key: SPARK-6162
 URL: https://issues.apache.org/jira/browse/SPARK-6162
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: Devesh Parekh


We build a lot of predictive models over data combined from multiple sources, 
where some entries may not have all sources of data and so some values are 
missing in each feature vector. Another place this might come up is if you have 
features from slightly heterogeneous items (or items composed of heterogeneous 
subcomponents) that share many features in common but may have extra features 
for different types, and you don't want to manually train models for every 
different type.

R's GBM library, which is what we are currently using, deals with this type of 
data nicely by making "missing" nodes in the decision tree (a surrogate split) 
for features that can have missing values. We'd like to do the same with MLLib, 
but LabeledPoint would need to support missing values, and GradientBoostedTrees 
would need to be modified to deal with them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5809) OutOfMemoryError in logDebug in RandomForest.scala

2015-02-17 Thread Devesh Parekh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324862#comment-14324862
 ] 

Devesh Parekh commented on SPARK-5809:
--

This was a naive run of GBM on TFIDF vectors produced by HashingTF 
(https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF),
 which creates 2^20 features (more than a million). What is the maximum number 
of features that GradientBoostedTrees will work for? I'll do a dimensionality 
reduction before trying again.

> OutOfMemoryError in logDebug in RandomForest.scala
> --
>
> Key: SPARK-5809
> URL: https://issues.apache.org/jira/browse/SPARK-5809
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Devesh Parekh
>Assignee: Joseph K. Bradley
>Priority: Minor
>  Labels: easyfix
>
> When training a GBM on sparse vectors produced by HashingTF, I get the 
> following OutOfMemoryError, where RandomForest is building a debug string to 
> log.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3326)
> at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121
> )
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at 
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
> at 
> scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:327
> )
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
> at 
> scala.collection.AbstractTraversable.addString(Traversable.scala:105)
> at 
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
> at 
> scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
> at 
> scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
> at 
> scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
> at 
> org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
> at 
> org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
> at org.apache.spark.Logging$class.logDebug(Logging.scala:63)
> at 
> org.apache.spark.mllib.tree.RandomForest.logDebug(RandomForest.scala:67)
> at 
> org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:150)
> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:64)
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
>  
> at 
> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
> A workaround until this is fixed is to modify log4j.properties in the conf 
> directory to filter out debug logs in RandomForest. For example:
> log4j.logger.org.apache.spark.mllib.tree.RandomForest=WARN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5809) OutOfMemoryError in logDebug in RandomForest.scala

2015-02-13 Thread Devesh Parekh (JIRA)
Devesh Parekh created SPARK-5809:


 Summary: OutOfMemoryError in logDebug in RandomForest.scala
 Key: SPARK-5809
 URL: https://issues.apache.org/jira/browse/SPARK-5809
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Devesh Parekh


When training a GBM on sparse vectors produced by HashingTF, I get the 
following OutOfMemoryError, where RandomForest is building a debug string to 
log.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3326)
at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121
)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at 
scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at 
scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:327
)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
at scala.collection.AbstractTraversable.addString(Traversable.scala:105)
at 
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
at scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
at 
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
at scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
at 
org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
at 
org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
at org.apache.spark.Logging$class.logDebug(Logging.scala:63)
at 
org.apache.spark.mllib.tree.RandomForest.logDebug(RandomForest.scala:67)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:150)
at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:64)
at 
org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
at 
org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
 
at 
org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)

A workaround until this is fixed is to modify log4j.properties in the conf 
directory to filter out debug logs in RandomForest. For example:
log4j.logger.org.apache.spark.mllib.tree.RandomForest=WARN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org