[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-26 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r53061



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []
+for p, v in paramMap.items():
+jsonParam = {'parent': p.parent, 'name': p.name}
+if (isinstance(v, Estimator) and not (
+isinstance(v, _ValidatorParams) or
+isinstance(v, OneVsRest))
+) or isinstance(v, Transformer) or \

Review comment:
   > It's easy to add the extension when that case is explicitly supported
   
   Seems no.
   Suppose there're some 3rd party libraries depends on spark. And 3rd party 
estimators may include an evaluator param (e.g. for supervising in training 
early stop) . We need to support them.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-24 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530060435



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []
+for p, v in paramMap.items():
+jsonParam = {'parent': p.parent, 'name': p.name}
+if (isinstance(v, Estimator) and not (
+isinstance(v, _ValidatorParams) or
+isinstance(v, OneVsRest))
+) or isinstance(v, Transformer) or \

Review comment:
   > The Validators class will directly take Estimator and Evaluator, and 
the Transformer will be part of the pipeline Estimator. Should the Transformer 
params be part of the pipeline params?
   
   A pyspark param value can be estimator/transformer/evaluator. They're all 
legal.
   Although currently pyspark does not have the case "transformer" to be a 
param value,
   but, allow it here is to provide extensibility.
   
   ```
   if (isinstance(v, Estimator) and not (
   isinstance(v, _ValidatorParams) or
   isinstance(v, OneVsRest))
   ) or isinstance(v, Transformer) or \
   isinstance(Evaluator):
   ```
   The logic try to keep equivalent with jvm side logic 
`isInstanceOf[DefaultParamsWritable] && !isInstanceOf[MLWritable]`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-24 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530056747



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []
+for p, v in paramMap.items():
+jsonParam = {'parent': p.parent, 'name': p.name}
+if (isinstance(v, Estimator) and not (
+isinstance(v, _ValidatorParams) or
+isinstance(v, OneVsRest))
+) or isinstance(v, Transformer) or \

Review comment:
   > It would be clearer if we can create an interface similar to 
DefaultParamsWritable
   
   Pyspark already has this interface, the issue is the existing pyspark 
estimator/transformer did not inherit `DefaultParamsWritable`, change them to 
inherit `DefaultParamsWritable` may cause breaking changes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-24 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530055236



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []

Review comment:
   `jsonParamMap` ==> `jsonParamList`: sounds good.
   
   `jsonEstimatorParamMaps`, the name `xxxMaps` actually means `xxxMapList`. 
Similar to existing codes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-24 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530055236



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []

Review comment:
   oh. there's a mistake here. I will update.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-24 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r530055236



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []

Review comment:
   they're maps in "json" style (can be directly serialized by 
json.dumps()) . So named like jsonXXX. It is similar to other existing codes





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-23 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529109498



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []
+for p, v in paramMap.items():
+jsonParam = {'parent': p.parent, 'name': p.name}
+if (isinstance(v, Estimator) and not (
+isinstance(v, _ValidatorParams) or
+isinstance(v, OneVsRest))
+) or isinstance(v, Transformer) or \
+isinstance(Evaluator):
+relative_path = f'epm_{p.name}{numParamsNotJson}'
+param_path = os.path.join(path, relative_path)
+numParamsNotJson += 1
+v.save(param_path)
+jsonParam['value'] = relative_path
+jsonParam['isJson'] = False
+elif isinstance(v, MLWritable):
+raise RuntimeError(
+"ValidatorSharedReadWrite.saveImpl does not handle 
parameters of type: "
+"MLWritable that are not 
Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+"it cannot be Validator or OneVsRest")

Review comment:
   In JVM side, the logic is only support `DefaultParamsWritable` instance. 
   Pyspark side we cannot use it, but we can equivalently use:
   Support Estimaor/Evaluator/Transformer type but excludes Validator or 
OneVsRest. These types in JVM side are inherits `DefaultParamsWritable`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-23 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529100588



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []
+for p, v in paramMap.items():
+jsonParam = {'parent': p.parent, 'name': p.name}
+if (isinstance(v, Estimator) and not (
+isinstance(v, _ValidatorParams) or
+isinstance(v, OneVsRest))
+) or isinstance(v, Transformer) or \
+isinstance(Evaluator):
+relative_path = f'epm_{p.name}{numParamsNotJson}'
+param_path = os.path.join(path, relative_path)
+numParamsNotJson += 1
+v.save(param_path)
+jsonParam['value'] = relative_path
+jsonParam['isJson'] = False
+elif isinstance(v, MLWritable):
+raise RuntimeError(
+"ValidatorSharedReadWrite.saveImpl does not handle 
parameters of type: "
+"MLWritable that are not 
Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+"it cannot be Validator or OneVsRest")
+else:
+jsonParam['value'] = v
+jsonParam['isJson'] = True
+jsonParamMap.append(jsonParam)
+jsonEstimatorParamMaps.append(jsonParamMap)
+
+skipParams = ['estimator', 'evaluator', 'estimatorParamMaps']
+
+jsonParams = {}
+for p, v in instance._paramMap.items():
+if p.name not in skipParams:
+jsonParams[p.name] = v
+
+jsonParams['estimatorParamMaps'] = jsonEstimatorParamMaps
+
+DefaultParamsWriter.saveMetadata(instance, path, sc, extraMetadata, 
jsonParams)
+evaluatorPath = os.path.join(path, 'evaluator')
+instance.getEvaluator().save(evaluatorPath)
+estimatorPath = os.path.join(path, 'estimator')
+instance.getEstimator().save(estimatorPath)
+
+@staticmethod
+def load(path, sc, metadata):
+evaluatorPath = os.path.join(path, 'evaluator')
+evaluator = DefaultParamsReader.loadParamsInstance(evaluatorPath, sc)
+estimatorPath = os.path.join(path, 'estimator')
+estimator = DefaultParamsReader.loadParamsInstance(estimatorPath, sc)
+
+uidToParams = MetaAlgorithmReadWrite.getUidMap(estimator)

Review comment:
   This map: `MetaAlgorithmReadWrite.getUidMap` used to find the parent of 
each param in `estimatorParamMaps`
   
   e.g.
   CrossValidator on pipeline, and pipeline include [transformer1, 
transformer2, estimator1], and CrossValidator want to tune on params : 
[transformer1.param1, transformer2.param1, estimator1.params1]
   
   https://github.com/apache/spark/pull/30471/files#r529101510
   
   Then the  getUidMap is used for: providing uid, get the corresponding 
transformer / estimator which is the parent of the param.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-23 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529101510



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []
+for p, v in paramMap.items():
+jsonParam = {'parent': p.parent, 'name': p.name}

Review comment:
   When we save a param,
   We will first save:
   * parent (uid of estimator/transformer)
   * param name





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-23 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529100588



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []
+for p, v in paramMap.items():
+jsonParam = {'parent': p.parent, 'name': p.name}
+if (isinstance(v, Estimator) and not (
+isinstance(v, _ValidatorParams) or
+isinstance(v, OneVsRest))
+) or isinstance(v, Transformer) or \
+isinstance(Evaluator):
+relative_path = f'epm_{p.name}{numParamsNotJson}'
+param_path = os.path.join(path, relative_path)
+numParamsNotJson += 1
+v.save(param_path)
+jsonParam['value'] = relative_path
+jsonParam['isJson'] = False
+elif isinstance(v, MLWritable):
+raise RuntimeError(
+"ValidatorSharedReadWrite.saveImpl does not handle 
parameters of type: "
+"MLWritable that are not 
Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+"it cannot be Validator or OneVsRest")
+else:
+jsonParam['value'] = v
+jsonParam['isJson'] = True
+jsonParamMap.append(jsonParam)
+jsonEstimatorParamMaps.append(jsonParamMap)
+
+skipParams = ['estimator', 'evaluator', 'estimatorParamMaps']
+
+jsonParams = {}
+for p, v in instance._paramMap.items():
+if p.name not in skipParams:
+jsonParams[p.name] = v
+
+jsonParams['estimatorParamMaps'] = jsonEstimatorParamMaps
+
+DefaultParamsWriter.saveMetadata(instance, path, sc, extraMetadata, 
jsonParams)
+evaluatorPath = os.path.join(path, 'evaluator')
+instance.getEvaluator().save(evaluatorPath)
+estimatorPath = os.path.join(path, 'estimator')
+instance.getEstimator().save(estimatorPath)
+
+@staticmethod
+def load(path, sc, metadata):
+evaluatorPath = os.path.join(path, 'evaluator')
+evaluator = DefaultParamsReader.loadParamsInstance(evaluatorPath, sc)
+estimatorPath = os.path.join(path, 'estimator')
+estimator = DefaultParamsReader.loadParamsInstance(estimatorPath, sc)
+
+uidToParams = MetaAlgorithmReadWrite.getUidMap(estimator)

Review comment:
   This map: `MetaAlgorithmReadWrite.getUidMap` used to find the parent of 
each param in `estimatorParamMaps`
   
   e.g.
   CrossValidator on pipeline, and pipeline include [transformer1, 
transformer2, estimator1], and CrossValidator want to tune on params : 
[transformer1.param1, transformer2.param1, estimator1.params1]
   
   Then the  getUidMap is used for: providing uid, get the corresponding 
transformer / estimator which is the parent of the param.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-23 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529100588



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []
+for p, v in paramMap.items():
+jsonParam = {'parent': p.parent, 'name': p.name}
+if (isinstance(v, Estimator) and not (
+isinstance(v, _ValidatorParams) or
+isinstance(v, OneVsRest))
+) or isinstance(v, Transformer) or \
+isinstance(Evaluator):
+relative_path = f'epm_{p.name}{numParamsNotJson}'
+param_path = os.path.join(path, relative_path)
+numParamsNotJson += 1
+v.save(param_path)
+jsonParam['value'] = relative_path
+jsonParam['isJson'] = False
+elif isinstance(v, MLWritable):
+raise RuntimeError(
+"ValidatorSharedReadWrite.saveImpl does not handle 
parameters of type: "
+"MLWritable that are not 
Estimaor/Evaluator/Transformer, and if parameter is estimator,"
+"it cannot be Validator or OneVsRest")
+else:
+jsonParam['value'] = v
+jsonParam['isJson'] = True
+jsonParamMap.append(jsonParam)
+jsonEstimatorParamMaps.append(jsonParamMap)
+
+skipParams = ['estimator', 'evaluator', 'estimatorParamMaps']
+
+jsonParams = {}
+for p, v in instance._paramMap.items():
+if p.name not in skipParams:
+jsonParams[p.name] = v
+
+jsonParams['estimatorParamMaps'] = jsonEstimatorParamMaps
+
+DefaultParamsWriter.saveMetadata(instance, path, sc, extraMetadata, 
jsonParams)
+evaluatorPath = os.path.join(path, 'evaluator')
+instance.getEvaluator().save(evaluatorPath)
+estimatorPath = os.path.join(path, 'estimator')
+instance.getEstimator().save(estimatorPath)
+
+@staticmethod
+def load(path, sc, metadata):
+evaluatorPath = os.path.join(path, 'evaluator')
+evaluator = DefaultParamsReader.loadParamsInstance(evaluatorPath, sc)
+estimatorPath = os.path.join(path, 'estimator')
+estimator = DefaultParamsReader.loadParamsInstance(estimatorPath, sc)
+
+uidToParams = MetaAlgorithmReadWrite.getUidMap(estimator)

Review comment:
   This map: `MetaAlgorithmReadWrite.getUidMap` used to find the parent of 
each param in `estimatorParamMaps`
   
   e.g.
   CrossValidator on pipeline, and pipeline include [transformer1, 
transformer2, estimator1], and CrossValidator want to tune on params : 
[transformer1.params1, transformer2.params1, estimator1]
   
   Then the  getUidMap is used for: providing uid, get the corresponding 
transformer / estimator which is the parent of the param.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a change in pull request #30471: [WIP][SPARK-33520][ML] make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-23 Thread GitBox


WeichenXu123 commented on a change in pull request #30471:
URL: https://github.com/apache/spark/pull/30471#discussion_r529099004



##
File path: python/pyspark/ml/tuning.py
##
@@ -207,6 +210,205 @@ def _to_java_impl(self):
 return java_estimator, java_epms, java_evaluator
 
 
+class _ValidatorSharedReadWrite:
+
+@staticmethod
+def saveImpl(path, instance, sc, extraMetadata=None):
+from pyspark.ml.classification import OneVsRest
+numParamsNotJson = 0
+jsonEstimatorParamMaps = []
+for paramMap in instance.getEstimatorParamMaps():
+jsonParamMap = []
+for p, v in paramMap.items():
+jsonParam = {'parent': p.parent, 'name': p.name}
+if (isinstance(v, Estimator) and not (
+isinstance(v, _ValidatorParams) or
+isinstance(v, OneVsRest))
+) or isinstance(v, Transformer) or \
+isinstance(Evaluator):
+relative_path = f'epm_{p.name}{numParamsNotJson}'
+param_path = os.path.join(path, relative_path)
+numParamsNotJson += 1
+v.save(param_path)
+jsonParam['value'] = relative_path
+jsonParam['isJson'] = False

Review comment:
   For the case:
   CrossValidator on estimator of OneVsRest, and need to turning the parameter 
`OneVsRest.classifier`, then  EstimatorParamMaps will include a param 
(OneVsRest.classifier -> A_Classifier)
   We need to consider how to saving Classifier type param value.
   This is what the code do:
* if it is normal param, we save it as Json format and `isJson` field is 
True.
* otherwise set `isJson` field False, and save it by object.save().





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org