[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r528698769 ## File path: python/pyspark/sql/functions.py ## @@ -3471,7 +3473,7 @@ def from_json(col, schema, options={}): return Column(jc) -def to_json(col, options={}): +def to_json(col, options=None): Review comment: Seems like we still have to modify a few annotations, right? Probably something like [functions.pyi.patch.txt](https://github.com/apache/spark/files/5583207/functions.pyi.patch.txt) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r508870060 ## File path: python/pyspark/sql/avro/functions.py ## @@ -26,7 +26,7 @@ @since(3.0) -def from_avro(data, jsonFormatSchema, options={}): +def from_avro(data, jsonFormatSchema, options=None): Review comment: FYI I am trying to establish optimal set of flags for mypy config, but it is a bit slowish work ‒ unlike standalone stubs, we have to correct for internal modules, tests and such, and I'd prefer to avoid atomic options (ignoring all errors there). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r508868556 ## File path: python/pyspark/sql/avro/functions.py ## @@ -26,7 +26,7 @@ @since(3.0) -def from_avro(data, jsonFormatSchema, options={}): +def from_avro(data, jsonFormatSchema, options=None): Review comment: > Mypy has been added, and I would expect that this would be caught. As far as I recall this behavior depends on the MyPy flags. In particular see [Disabling strict optional checking](https://mypy.readthedocs.io/en/stable/kinds_of_types.html#no-strict-optional). So adding `no_implicit_optional` to `mypy.ini` should catch this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r508868556 ## File path: python/pyspark/sql/avro/functions.py ## @@ -26,7 +26,7 @@ @since(3.0) -def from_avro(data, jsonFormatSchema, options={}): +def from_avro(data, jsonFormatSchema, options=None): Review comment: As far as I recall this behavior depends on the MyPy flags. In particular see [Disabling strict optional checking](https://mypy.readthedocs.io/en/stable/kinds_of_types.html#no-strict-optional). So adding `no_implicit_optional` to `mypy.ini` should catch this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r508864627 ## File path: python/pyspark/ml/regression.py ## @@ -1654,7 +1656,7 @@ class _AFTSurvivalRegressionParams(_PredictorParams, HasMaxIter, HasTol, HasFitI def __init__(self, *args): super(_AFTSurvivalRegressionParams, self).__init__(*args) self._setDefault(censorCol="censor", - quantileProbabilities=[0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99], + quantileProbabilities=DEFAULT_QUANTILE_PROBABILITIES, Review comment: Sorry for the confusion ‒ I just wanted to point it out to see if I missed something, but I don't think we should change it here, as this will be a breaking change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r508831474 ## File path: python/pyspark/ml/regression.py ## @@ -1654,7 +1656,7 @@ class _AFTSurvivalRegressionParams(_PredictorParams, HasMaxIter, HasTol, HasFitI def __init__(self, *args): super(_AFTSurvivalRegressionParams, self).__init__(*args) self._setDefault(censorCol="censor", - quantileProbabilities=[0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99], + quantileProbabilities=DEFAULT_QUANTILE_PROBABILITIES, Review comment: Side note ‒ I remember this was included for Scala parity (SPARK-32310), but isn't that a dead code, give we use the same default in estimator and model constructors? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r508828995 ## File path: python/pyspark/ml/regression.py ## @@ -1740,7 +1742,7 @@ class AFTSurvivalRegression(_JavaRegressor, _AFTSurvivalRegressionParams, @keyword_only def __init__(self, *, featuresCol="features", labelCol="label", predictionCol="prediction", fitIntercept=True, maxIter=100, tol=1E-6, censorCol="censor", - quantileProbabilities=list([0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]), + quantileProbabilities=DEFAULT_QUANTILE_PROBABILITIES, Review comment: This is still a mutable default, isn't it? Shouldn't `DEFAULT_QUANTILE_PROBABILITIES` by a `tuple`, if we want to be consistent? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r508825660 ## File path: python/pyspark/ml/tuning.py ## @@ -868,12 +868,12 @@ class TrainValidationSplitModel(Model, _TrainValidationSplitParams, MLReadable, .. versionadded:: 2.0.0 """ -def __init__(self, bestModel, validationMetrics=[], subModels=None): +def __init__(self, bestModel, validationMetrics=None, subModels=None): Review comment: Annotation https://github.com/apache/spark/blob/46ad325e56abd95c0ffdbe64aad78582da8c725d/python/pyspark/ml/tuning.pyi#L174 should change to `validationMetrics: Optional[List[float]] = ...,` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r508825239 ## File path: python/pyspark/ml/tuning.py ## @@ -508,13 +508,13 @@ class CrossValidatorModel(Model, _CrossValidatorParams, MLReadable, MLWritable): .. versionadded:: 1.4.0 """ -def __init__(self, bestModel, avgMetrics=[], subModels=None): +def __init__(self, bestModel, avgMetrics=None, subModels=None): Review comment: Annotation https://github.com/apache/spark/blob/46ad325e56abd95c0ffdbe64aad78582da8c725d/python/pyspark/ml/tuning.pyi#L107 should change to `avgMetrics: Optional[List[float]] = ...,` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r508823684 ## File path: python/pyspark/sql/functions.py ## @@ -2476,7 +2478,7 @@ def json_tuple(col, *fields): @since(2.1) -def from_json(col, schema, options={}): +def from_json(col, schema, options=None): Review comment: Same as for `sql.avro.functions`, we'll need here `Optional[Dict[str, str]]` (applies also to the remaining functions). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r508823033 ## File path: python/pyspark/sql/avro/functions.py ## @@ -26,7 +26,7 @@ @since(3.0) -def from_avro(data, jsonFormatSchema, options={}): +def from_avro(data, jsonFormatSchema, options=None): Review comment: We should change annotations from ```python def from_avro( data: ColumnOrName, jsonFormatSchema: str, options: Dict[str, str] = ... ) -> Column: .. ``` to ```python def from_avro( data: ColumnOrName, jsonFormatSchema: str, options: Optional[Dict[str, str]] = ... ) -> Column: .. ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r458291068 ## File path: python/pyspark/sql/functions.py ## @@ -2392,7 +2393,7 @@ def json_tuple(col, *fields): @since(2.1) -def from_json(col, schema, options={}): +def from_json(col, schema, options: Dict = None): Review comment: How about continue the discussion [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-tp26232p29866.html) for a better visibility? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r458287046 ## File path: python/pyspark/sql/functions.py ## @@ -2392,7 +2393,7 @@ def json_tuple(col, *fields): @since(2.1) -def from_json(col, schema, options={}): +def from_json(col, schema, options: Dict = None): Review comment: > Personally, I don't see any harm in adding the types over time, and work towards coverage, The biggest problem is the resolution. As far as I am aware it is all-or-nothing at the module level (assuming `partial` mode), so working over time will likely degrade experience of the users that already use annotations. And it seems somewhat unnecessary, considering that you could migrate stubs into main repo with a single copy, if there was a will to maintain the result. Inline variant would be trickier (and there is still concern of https://bugs.python.org/issue39168). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r458291068 ## File path: python/pyspark/sql/functions.py ## @@ -2392,7 +2393,7 @@ def json_tuple(col, *fields): @since(2.1) -def from_json(col, schema, options={}): +def from_json(col, schema, options: Dict = None): Review comment: How about continue the discussion [here](Given a discussion related to https://github.com/apache/spark/pull/29122;>SPARK-32320 PR I'd like to resurrect this thread. Is there any interest in migrating annotations to the main repository?) for a better visibility? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r458287046 ## File path: python/pyspark/sql/functions.py ## @@ -2392,7 +2393,7 @@ def json_tuple(col, *fields): @since(2.1) -def from_json(col, schema, options={}): +def from_json(col, schema, options: Dict = None): Review comment: > Personally, I don't see any harm in adding the types over time, and work towards coverage, The biggest problem is the resolution (https://bugs.python.org/issue39168 seems to be fixed now). As far as I am aware it is all-or-nothing at the module level (assuming `partial` mode), so working over time will likely degrade experience of the users that already use annotations. And it seems somewhat unnecessary, considering that you could migrate stubs into main repo with a single copy, if there was a will to maintain the result. Inline variant would be trickier (and there is still concern of https://bugs.python.org/issue39168). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r456756157 ## File path: python/pyspark/sql/functions.py ## @@ -2392,7 +2393,7 @@ def json_tuple(col, *fields): @since(2.1) -def from_json(col, schema, options={}): +def from_json(col, schema, options: Dict = None): Review comment: Pre 3.6 wasn't an issue anyway.. In practice, given the size of the project and overall complexity of annotations, stubs seem to be a somewhat better choice anyway (avoid cyclic imports or certain `Generic` pitfalls for starters, not being stuck in legacy Python support). This means that potential migration might be relatively easy. My biggest concern is that over four years I haven't seen any active Spark contributors willing to help with maintenance (and that's despite gentle pings). Additionally annotation ecosystem is still evolving ‒ so having a project that can iterate faster than annotated codebase is not a bad thing. The biggest advantage of maintaining annotations alongside actual codebase is that it requires conscious choices of signatures ‒ not every type of signature can be annotated in a meaningful way and some can result in exploding overloads. But it doesn't really remove the maintenance overhead ‒ for example changes in the ML API, mostly invisible for the final user, wreaked havoc, as it is the most vulnerable part of code (a lot of generics there). Anyway... That's just my 2 cents. If you reanimate the discussion. please ping me ‒ I'd like to see how it goes. TIA This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r456040994 ## File path: python/pyspark/ml/tuning.py ## @@ -500,13 +501,13 @@ class CrossValidatorModel(Model, _CrossValidatorParams, MLReadable, MLWritable): .. versionadded:: 1.4.0 """ -def __init__(self, bestModel, avgMetrics=[], subModels=None): +def __init__(self, bestModel, avgMetrics: List = None, subModels=None): Review comment: Out of curiosity ‒why do we need default here? ## File path: python/pyspark/ml/tuning.py ## @@ -500,13 +501,13 @@ class CrossValidatorModel(Model, _CrossValidatorParams, MLReadable, MLWritable): .. versionadded:: 1.4.0 """ -def __init__(self, bestModel, avgMetrics=[], subModels=None): +def __init__(self, bestModel, avgMetrics: List = None, subModels=None): Review comment: Out of curiosity ‒ why do we need default here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r456033843 ## File path: python/pyspark/sql/functions.py ## @@ -2392,7 +2393,7 @@ def json_tuple(col, *fields): @since(2.1) -def from_json(col, schema, options={}): +def from_json(col, schema, options: Dict = None): Review comment: > Furthermore, it is easier to keep the code and annotations in sync. Curious what Maciej thinks. That's the topic for a long discussion :) Overall, I made some points [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html) and [here](http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html) ‒ I think these are still valid. Now... I am still opened about merging `pyspark-stubs` about with the main project. However, have some annotations in the code-base, if there is no will to maintain these (and maintaining useful annotations is honestly a pain) and bring them to high coverage any time soon, as it may make things either wasted effort (if not used at all, if package is not PEP 561 compliant) or maybe even harmful. That's a topic for a longer discussion though... In general I'd bring the discussion to dev and or JIRA first. On a side note ‒ having incomplete annotation for a signature is not very useful in practice, especially when you miss return type annotations. Additionally, these should be `Optional` and in most of the cases key type could be annotated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #29122: [SPARK-32320][PYSPARK] Remove mutable default arguments
zero323 commented on a change in pull request #29122: URL: https://github.com/apache/spark/pull/29122#discussion_r456033843 ## File path: python/pyspark/sql/functions.py ## @@ -2392,7 +2393,7 @@ def json_tuple(col, *fields): @since(2.1) -def from_json(col, schema, options={}): +def from_json(col, schema, options: Dict = None): Review comment: > Furthermore, it is easier to keep the code and annotations in sync. Curious what Maciej thinks. That's the topic for a long discussion :) Overall, I made some points [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html) and [here](http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html) ‒ I think these are still valid. Now... I am still opened about merging `pyspark-stubs` about with the main project. However, have some annotations in the code-base, if there is no will to maintain these (and maintaining useful annotations is honestly a pain) and bring them to high coverage any time soon, as it may make things either wasted effort (if not used at all, if package is not PEP 561 compliant) or maybe even harmful. That's a topic for a longer discussion though... In general I'd bring the discussion to dev and or JIRA first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org