[jira] [Comment Edited] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challengi

2021-09-22 Thread Rafael Hernandez Murcia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418643#comment-17418643
 ] 

Rafael Hernandez Murcia edited comment on SPARK-34591 at 9/22/21, 2:52 PM:
---

I noticed this behavior of the DecisionTreeClassifier by chance in a use case 
when upgrading from Spark 2.2.3 to 2.4.7

I do not understand how it is possible that this issue is not critical. 

Regarding unbalanced data sets, both DecisionTreeClassifier and 
RandomForestClassifier are useless for many uses cases that we are working on.

I hope that that this will be fix soon, even if it is just by setting the 
default pruning behaviour to False.

I saw the opened PR but I'm worry that we'll have to wait a lot for the 
solution. 


was (Author: rafael_hernandez):
I noticed this behavior of the DecisionTreeClassifier by chance in a use case 
when upgrading from Spark 2.2.3 to 2.4.7

I do not understand how it is possible that this issue is not critical. 

Regarding unbalanced data sets, both DecisionTreeClassifier and 
RandomForestClassifier are useless for many uses cases that we are working on.

I hope that that this will be fix soon, even if it is just by setting the 
default pruning behaviour to False.

I saw the open PR but I'm worry that we'll have to wait a lot for the solution. 

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug - no 2.pdf, 
> Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> 

[jira] [Commented] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to

2021-09-22 Thread Rafael Hernandez Murcia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418643#comment-17418643
 ] 

Rafael Hernandez Murcia commented on SPARK-34591:
-

I noticed this behavior of the DecisionTreeClassifier by chance in a use case 
when upgrading from Spark 2.2.3 to 2.4.7

I do not understand how it is possible that this issue is not critical. 

Regarding unbalanced data sets, both DecisionTreeClassifier and 
RandomForestClassifier are useless for many uses cases that we are working on.

I hope that that this will be fix soon, even if it is just by setting the 
default pruning behaviour to False.

I saw the open PR but I'm worry that we'll have to wait a lot for the solution. 

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug - no 2.pdf, 
> Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  

[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-06-14 Thread Rafael Hernandez Murcia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512732#comment-16512732
 ] 

Rafael Hernandez Murcia commented on SPARK-17025:
-

Any news about this? It seems that there's a nice workaround over there: 
[https://stackoverflow.com/questions/41399399/serialize-a-custom-transformer-using-python-to-be-used-within-a-pyspark-ml-pipel]
 but I wouldn't like to keep it as a permanent solution...

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org