[ https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng resolved SPARK-31497. ----------------------------------- Resolution: Fixed Issue resolved by pull request 28279 [https://github.com/apache/spark/pull/28279] > Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot > save and load model > ---------------------------------------------------------------------------------------------- > > Key: SPARK-31497 > URL: https://issues.apache.org/jira/browse/SPARK-31497 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 2.4.5 > Reporter: Weichen Xu > Assignee: Weichen Xu > Priority: Major > Fix For: 3.0.0 > > > Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot > save and load model. > Reproduce code run in pyspark shell: > 1) Train model and save model in pyspark: > {code:python} > from pyspark.ml import Pipeline > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.ml.feature import HashingTF, Tokenizer > from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, > ParamGridBuilder > training = spark.createDataFrame([ > (0, "a b c d e spark", 1.0), > (1, "b d", 0.0), > (2, "spark f g h", 1.0), > (3, "hadoop mapreduce", 0.0), > (4, "b spark who", 1.0), > (5, "g d a y", 0.0), > (6, "spark fly", 1.0), > (7, "was mapreduce", 0.0), > (8, "e spark program", 1.0), > (9, "a e c l", 0.0), > (10, "spark compile", 1.0), > (11, "hadoop software", 0.0) > ], ["id", "text", "label"]) > # Configure an ML pipeline, which consists of tree stages: tokenizer, > hashingTF, and lr. > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression(maxIter=10) > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ > .addGrid(lr.regParam, [0.1, 0.01]) \ > .build() > crossval = CrossValidator(estimator=pipeline, > estimatorParamMaps=paramGrid, > evaluator=BinaryClassificationEvaluator(), > numFolds=2) # use 3+ folds in practice > # Run cross-validation, and choose the best set of parameters. > cvModel = crossval.fit(training) > cvModel.save('/tmp/cv_model001') # save model failed. Rase error. > {code} > 2): Train crossvalidation model in scala with similar code above, and save to > '/tmp/model_cv_scala001', run following code in pyspark: > {code:python} > from pyspark.ml.tuning import CrossValidatorModel > CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org