[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20499 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166320786 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None): return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx) @since(1.4) -def replace(self, to_replace, value=None, subset=None): +def replace(self, to_replace, *args, **kwargs): --- End diff -- Yea, to me either way works to me. Let me try to look around this a bit more and give a shot to show how it looks like. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166310719 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None): return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx) @since(1.4) -def replace(self, to_replace, value=None, subset=None): +def replace(self, to_replace, *args, **kwargs): --- End diff -- I prefer `def replace(self, to_replace, value=_NoValue, subset=None)`. `def replace(self, to_replace, *args, **kwargs)` loses the information about `value` and `subset`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166302987 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None): return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx) @since(1.4) -def replace(self, to_replace, value=None, subset=None): +def replace(self, to_replace, *args, **kwargs): --- End diff -- It's just as is: ``` def replace(self, to_replace, *args, **kwargs) ``` but this is better than `replace(self, to_replace, value={}, subset=None)` IMHO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166302599 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None): return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx) @since(1.4) -def replace(self, to_replace, value=None, subset=None): +def replace(self, to_replace, *args, **kwargs): --- End diff -- what's the docstring for `def replace(self, to_replace, *args, **kwargs)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166301172 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None): return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx) @since(1.4) -def replace(self, to_replace, value=None, subset=None): +def replace(self, to_replace, *args, **kwargs): --- End diff -- So, to cut it short, yea, if less pretty doc is fine, I can try. That would reduce the change a lot. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166300631 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None): return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx) @since(1.4) -def replace(self, to_replace, value=None, subset=None): +def replace(self, to_replace, *args, **kwargs): --- End diff -- Yea, I think that summarises the issue > Can we use an invalid value as the default value for value? Then we can throw exception if the value is not set by user. Yea, we could define a class / instance to indeicate no value like NumPy does - https://github.com/numpy/numpy/blob/master/numpy/_globals.py#L76 . I was thinking resembling this way too but this is kind of a new approach to Spark and this is a single case so far. To get to the point, yea, we could maybe use an invalid value and unset if `to_replace` is a dictionary. For example, I can assign `{}`. But then the problem is docstring by pydoc and API documentation. It will show something like: ``` Help on method replace in module pyspark.sql.dataframe: replace(self, to_replace, value={}, subset=None) method of pyspark.sql.dataframe.DataFrame instance Returns a new :class:`DataFrame` replacing a value with another value. ... ``` This is pretty confusing. Up to my knowledge, we can't really override this signature - I tried few times before, and I failed if I remember this correctly. Maybe, this is good enough but I didn't want to start it by such because the issue @rxin raised sounds like because it has a default value, to be more strictly. To be honest, seems Pandas's `replace` also has `None` for default value - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html#pandas.DataFrame.replace. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166290471 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None): return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx) @since(1.4) -def replace(self, to_replace, value=None, subset=None): +def replace(self, to_replace, *args, **kwargs): --- End diff -- I see the problem now. If `to_replace` is dict, then `value` should be ignored and we should provide a default value. If `to_replace` is not dict, then `value` is required and we should not provide a default value. Can we use an invalid value as the default value for `value`? Then we can throw exception if the `value` is not set by user. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166286067 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None): return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx) @since(1.4) -def replace(self, to_replace, value=None, subset=None): +def replace(self, to_replace, *args, **kwargs): --- End diff -- Yes, if `value` is explicitly given, I thought ignoring `value` as we did from the first place. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166278935 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None): return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx) @since(1.4) -def replace(self, to_replace, value=None, subset=None): +def replace(self, to_replace, *args, **kwargs): --- End diff -- what's the expectation? if `to_replace` is a dict, `value` should be ignored? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166220816 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1545,8 +1545,8 @@ def replace(self, to_replace, value=None, subset=None): :param to_replace: bool, int, long, float, string, list or dict. Value to be replaced. -If the value is a dict, then `value` is ignored and `to_replace` must be a -mapping between a value and a replacement. +If the value is a dict, then `value` is ignored or can be omitted, and `to_replace` --- End diff -- now there is no parameter named `value`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166217503 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1587,6 +1597,51 @@ def replace(self, to_replace, value=None, subset=None): |null| null|null| ++--++ """ + +# It deals with a problem when 'value' is set to None and 'to_replace' is a dictionary. +# Validate if arguments are missing or not. +is_more_than_two = len(args) + len(kwargs) > 2 --- End diff -- I read this few times and still feel that this is kind of verbose. But seems there is no better way to check if an optional parameter is set or not in Python. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r166216324 --- Diff: python/pyspark/sql/tests.py --- @@ -2175,7 +2175,7 @@ def test_replace(self): # replace with subset specified by a string of a column name w/ actual change row = self.spark.createDataFrame( -[(u'Alice', 10, 80.1)], schema).replace(10, 20, subset='age').first() +[(u'Alice', 10, 80.1)], schema).replace(10, 'age', value=20).first() --- End diff -- Will this conflict with the convention of function argument in Python? Usually, I think the arguments before keyword arg are resolved by position. But now `age` is resolved to `subset` which is the third argument behind `value`. Since the function signature is changed, this may not be a big issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r165951235 --- Diff: python/pyspark/sql/tests.py --- @@ -2186,7 +2186,7 @@ def test_replace(self): # replace with subset specified with one column replaced, another column not in subset # stays unchanged. row = self.spark.createDataFrame( -[(u'Alice', 10, 10.0)], schema).replace(10, 20, subset=['name', 'age']).first() +[(u'Alice', 10, 10.0)], schema).replace(10, value=20, subset=['name', 'age']).first() --- End diff -- I don't think it's necessary but let me keep them since at least it tests different combinations of valid cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r165950192 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1557,6 +1557,9 @@ def replace(self, to_replace, value=None, subset=None): For example, if `value` is a string, and subset contains a non-string column, then the non-string column is simply ignored. +.. note:: `value` can only be omitted when `to_replace` is a dictionary. Otherwise, +it is required. --- End diff -- Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r165883566 --- Diff: python/pyspark/sql/tests.py --- @@ -2186,7 +2186,7 @@ def test_replace(self): # replace with subset specified with one column replaced, another column not in subset # stays unchanged. row = self.spark.createDataFrame( -[(u'Alice', 10, 10.0)], schema).replace(10, 20, subset=['name', 'age']).first() +[(u'Alice', 10, 10.0)], schema).replace(10, value=20, subset=['name', 'age']).first() --- End diff -- Are the above two test changes necessary? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r165884365 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1587,6 +1600,52 @@ def replace(self, to_replace, value=None, subset=None): |null| null|null| ++--++ """ + +# It deals with a problem when 'value' is set to None and 'to_replace' is a dictionary. +# Validate if arguments are missing or not. +is_more_than_two = len(args) + len(kwargs) > 2 +if is_more_than_two: +raise TypeError( +"The number of arguments should not be more than 3; however, got " +"%s arguments." % len([to_replace] + list(args) + list(kwargs.values( + +is_unexpected_kwargs = \ +len(kwargs) == 2 and ("value" not in kwargs or "subset" not in kwargs) --- End diff -- ```python df.na.replace({'Alice': 'Bob'}, foo = 'bar').show() ``` Seems this case can't be detected? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r165883051 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1557,6 +1557,9 @@ def replace(self, to_replace, value=None, subset=None): For example, if `value` is a string, and subset contains a non-string column, then the non-string column is simply ignored. +.. note:: `value` can only be omitted when `to_replace` is a dictionary. Otherwise, +it is required. --- End diff -- Shall we just describe this in `value`'s param doc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r165834947 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1587,6 +1600,51 @@ def replace(self, to_replace, value=None, subset=None): |null| null|null| ++--++ """ + +# It deals with a problem when 'value' is set to None and 'to_replace' is a dictionary. +is_more_than_two = len(args) + len(kwargs) > 2 +if is_more_than_two: +raise TypeError( +"Arguments are expected to be to_replace, value and subset; however, got " +"[%s]" % ", ".join(map(str, [to_replace] + list(args) + list(kwargs.values() + +is_unexpected_kwargs = \ +len(kwargs) == 2 and ("value" not in kwargs or "subset" not in kwargs) +if is_unexpected_kwargs: +raise TypeError( +"Arguments are expected to be to_replace, value and subset; however, got " +"unexpected keyword arguments: [%s]" % ", ".join( +filter(lambda a: a not in ["value", "subset"], kwargs.keys( + +if "value" in kwargs and "subset" in kwargs: +# replace(..., value=..., subset=...) +# replace(..., subset=..., value=...) +value = kwargs["value"] +subset = kwargs["subset"] +elif "value" in kwargs: +# replace(..., ..., value=...) +value = kwargs["value"] +subset = args[0] if len(args) == 1 else None +elif "subset" in kwargs: +# replace(..., ..., subset=...) +if len(args) == 1: +value = args[0] +elif isinstance(to_replace, dict): +value = None # When to_replace is a dictionary, value can be omitted. +else: +raise TypeError("value is required when to_replace is not a dictionary.") --- End diff -- Hm, I think that check is still valid. The newly added logic here focuses on checking missing arguments whereas the below logics focus on checking if arguments are valid. Will try to add a explicit test for https://github.com/apache/spark/pull/20499#discussion_r165828448 case with few comment changes. For https://github.com/apache/spark/pull/20499#discussion_r165828519, I just tried to check. Seems we should keep that `None` to support: ```python >>> df.na.replace('Alice', None).show() ``` ``` ++--++ | age|height|name| ++--++ | 10|80|null| ... ``` ``` ... ValueError: If to_replace is not a dict, value should be a bool, float, int, long, string, list, tuple or None. Got ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r165828519 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1587,6 +1600,51 @@ def replace(self, to_replace, value=None, subset=None): |null| null|null| ++--++ """ + +# It deals with a problem when 'value' is set to None and 'to_replace' is a dictionary. +is_more_than_two = len(args) + len(kwargs) > 2 +if is_more_than_two: +raise TypeError( +"Arguments are expected to be to_replace, value and subset; however, got " +"[%s]" % ", ".join(map(str, [to_replace] + list(args) + list(kwargs.values() + +is_unexpected_kwargs = \ +len(kwargs) == 2 and ("value" not in kwargs or "subset" not in kwargs) +if is_unexpected_kwargs: +raise TypeError( +"Arguments are expected to be to_replace, value and subset; however, got " +"unexpected keyword arguments: [%s]" % ", ".join( +filter(lambda a: a not in ["value", "subset"], kwargs.keys( + +if "value" in kwargs and "subset" in kwargs: +# replace(..., value=..., subset=...) +# replace(..., subset=..., value=...) +value = kwargs["value"] +subset = kwargs["subset"] +elif "value" in kwargs: +# replace(..., ..., value=...) +value = kwargs["value"] +subset = args[0] if len(args) == 1 else None +elif "subset" in kwargs: +# replace(..., ..., subset=...) +if len(args) == 1: +value = args[0] +elif isinstance(to_replace, dict): +value = None # When to_replace is a dictionary, value can be omitted. +else: +raise TypeError("value is required when to_replace is not a dictionary.") --- End diff -- Btw, can't we just remove `value is not None` in above to let `None` disallowed when `to_replace` is not a dict? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20499#discussion_r165828448 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1587,6 +1600,51 @@ def replace(self, to_replace, value=None, subset=None): |null| null|null| ++--++ """ + +# It deals with a problem when 'value' is set to None and 'to_replace' is a dictionary. +is_more_than_two = len(args) + len(kwargs) > 2 +if is_more_than_two: +raise TypeError( +"Arguments are expected to be to_replace, value and subset; however, got " +"[%s]" % ", ".join(map(str, [to_replace] + list(args) + list(kwargs.values() + +is_unexpected_kwargs = \ +len(kwargs) == 2 and ("value" not in kwargs or "subset" not in kwargs) +if is_unexpected_kwargs: +raise TypeError( +"Arguments are expected to be to_replace, value and subset; however, got " +"unexpected keyword arguments: [%s]" % ", ".join( +filter(lambda a: a not in ["value", "subset"], kwargs.keys( + +if "value" in kwargs and "subset" in kwargs: +# replace(..., value=..., subset=...) +# replace(..., subset=..., value=...) +value = kwargs["value"] +subset = kwargs["subset"] +elif "value" in kwargs: +# replace(..., ..., value=...) +value = kwargs["value"] +subset = args[0] if len(args) == 1 else None +elif "subset" in kwargs: +# replace(..., ..., subset=...) +if len(args) == 1: +value = args[0] +elif isinstance(to_replace, dict): +value = None # When to_replace is a dictionary, value can be omitted. +else: +raise TypeError("value is required when to_replace is not a dictionary.") --- End diff -- There are some old check below like: ```python if not isinstance(value, valid_types) and value is not None \ and not isinstance(to_replace, dict): raise ValueError("If to_replace is not a dict, value should be " "a bool, float, int, long, string, list, tuple or None. " "Got {0}".format(type(value))) ``` Should we clean up it too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org