[jira] [Commented] (SPARK-45154) Pyspark DecisionTreeClassifier: results and tree structure in spark3 very different from that of the spark2 version on the same data and with the same hyperparameters.

APeng Zhang (Jira) Tue, 12 Dec 2023 04:35:06 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-45154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795716#comment-17795716
 ]


APeng Zhang commented on SPARK-45154:
-------------------------------------

[~oumarnour] I think you need to set the _seed_ param of CrossValidator.

> Pyspark DecisionTreeClassifier: results and tree structure in spark3 very 
> different from that of the spark2 version on the same data and with the same 
> hyperparameters.
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-45154
>                 URL: https://issues.apache.org/jira/browse/SPARK-45154
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, PySpark, Spark Core
>    Affects Versions: 3.0.0, 3.3.1, 3.2.4, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>            Reporter: Oumar Nour
>            Priority: Critical
>              Labels: decisiontree, pyspark3, spark2, spark3
>
> Hello,
> I have an engine running on spark2 using a DecisionTreeClassifier model using 
> the CrossValidator. 
>  
> {code:java}
> dt  = DecisionTreeClassifier(maxBins=10000, seed=0)   
> cv_dt_evaluator = BinaryClassificationEvaluator(
>             metricName="", 
>             rawPredictionCol="probability")
> # Create param grid and cross validator for model selection
> dt_grid = ParamGridBuilder()\
>             .addGrid(
>                 dt.minInstancesPerNode, [100]
>         )\
>             .addGrid(
>                 dt.maxDepth, [10]
>         )\
>             .build()
> cv = CrossValidator(
>             estimator=dt, estimatorParamMaps=dt_grid, 
> evaluator=cv_dt_evaluator,
>             parallelism=4
>             numFolds=4
>         ){code}
>  
> I want to {*}migrate from spark2  to spark3{*}. I've run 
> *DecisionTreeClassifier* on the same data with the same parameter values. But 
> unfortunately my results are {*}completely different, especially in terms of 
> tree structure{*}. I have trees with less depth and fewer splits on spark3. 
> I've tried to read the documentation but I haven't found an answer to my 
> question.
>  
> Can you help me find a solution to this problem?
> Thanks in advance for your help 
>         
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45154) Pyspark DecisionTreeClassifier: results and tree structure in spark3 very different from that of the spark2 version on the same data and with the same hyperparameters.

Reply via email to