[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20499


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166320786
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
 return DataFrame(self._jdf.na().fill(value, 
self._jseq(subset)), self.sql_ctx)
 
 @since(1.4)
-def replace(self, to_replace, value=None, subset=None):
+def replace(self, to_replace, *args, **kwargs):
--- End diff --

Yea, to me either way works to me. Let me try to look around this a bit 
more and give a shot to show how it looks like.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166310719
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
 return DataFrame(self._jdf.na().fill(value, 
self._jseq(subset)), self.sql_ctx)
 
 @since(1.4)
-def replace(self, to_replace, value=None, subset=None):
+def replace(self, to_replace, *args, **kwargs):
--- End diff --

I prefer `def replace(self, to_replace, value=_NoValue, subset=None)`.

`def replace(self, to_replace, *args, **kwargs)` loses the information 
about `value` and `subset`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166302987
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
 return DataFrame(self._jdf.na().fill(value, 
self._jseq(subset)), self.sql_ctx)
 
 @since(1.4)
-def replace(self, to_replace, value=None, subset=None):
+def replace(self, to_replace, *args, **kwargs):
--- End diff --

It's just as is:

```
def replace(self, to_replace, *args, **kwargs)
```

but this is better than `replace(self, to_replace, value={}, subset=None)` 
IMHO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166302599
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
 return DataFrame(self._jdf.na().fill(value, 
self._jseq(subset)), self.sql_ctx)
 
 @since(1.4)
-def replace(self, to_replace, value=None, subset=None):
+def replace(self, to_replace, *args, **kwargs):
--- End diff --

what's the docstring for `def replace(self, to_replace, *args, **kwargs)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166301172
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
 return DataFrame(self._jdf.na().fill(value, 
self._jseq(subset)), self.sql_ctx)
 
 @since(1.4)
-def replace(self, to_replace, value=None, subset=None):
+def replace(self, to_replace, *args, **kwargs):
--- End diff --

So, to cut it short, yea, if less pretty doc is fine, I can try. That would 
reduce the change a lot.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166300631
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
 return DataFrame(self._jdf.na().fill(value, 
self._jseq(subset)), self.sql_ctx)
 
 @since(1.4)
-def replace(self, to_replace, value=None, subset=None):
+def replace(self, to_replace, *args, **kwargs):
--- End diff --

Yea, I think that summarises the issue 

> Can we use an invalid value as the default value for value? Then we can 
throw exception if the value is not set by user.

Yea, we could define a class / instance to indeicate no value like NumPy 
does -
 https://github.com/numpy/numpy/blob/master/numpy/_globals.py#L76 . I was 
thinking resembling this way too but this is kind of a new approach to Spark 
and this is a single case so far.

To get to the point, yea, we could maybe use an invalid value and unset if 
`to_replace` is a dictionary. For example, I can assign `{}`. But then the 
problem is docstring by pydoc and API documentation. It will show something 
like:

```
Help on method replace in module pyspark.sql.dataframe:

replace(self, to_replace, value={}, subset=None) method of 
pyspark.sql.dataframe.DataFrame instance
Returns a new :class:`DataFrame` replacing a value with another value.
...
```

This is pretty confusing. Up to my knowledge, we can't really override this 
signature - I tried few times before, and I failed if I remember this correctly.

Maybe, this is good enough but I didn't want to start it by such because 
the issue @rxin raised sounds like because it has a default value, to be more 
strictly. 

To be honest, seems Pandas's `replace` also has `None` for default value -
 
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html#pandas.DataFrame.replace.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166290471
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
 return DataFrame(self._jdf.na().fill(value, 
self._jseq(subset)), self.sql_ctx)
 
 @since(1.4)
-def replace(self, to_replace, value=None, subset=None):
+def replace(self, to_replace, *args, **kwargs):
--- End diff --

I see the problem now. If `to_replace` is dict, then `value` should be 
ignored and we should provide a default value. If `to_replace` is not dict, 
then `value` is required and we should not provide a default value.

Can we use an invalid value as the default value for `value`? Then we can 
throw exception if the `value` is not set by user.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166286067
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
 return DataFrame(self._jdf.na().fill(value, 
self._jseq(subset)), self.sql_ctx)
 
 @since(1.4)
-def replace(self, to_replace, value=None, subset=None):
+def replace(self, to_replace, *args, **kwargs):
--- End diff --

Yes, if `value` is explicitly given, I thought ignoring `value` as we did 
from the first place.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166278935
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
 return DataFrame(self._jdf.na().fill(value, 
self._jseq(subset)), self.sql_ctx)
 
 @since(1.4)
-def replace(self, to_replace, value=None, subset=None):
+def replace(self, to_replace, *args, **kwargs):
--- End diff --

what's the expectation? if `to_replace` is a dict, `value` should be 
ignored?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166220816
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1545,8 +1545,8 @@ def replace(self, to_replace, value=None, 
subset=None):
 
 :param to_replace: bool, int, long, float, string, list or dict.
 Value to be replaced.
-If the value is a dict, then `value` is ignored and 
`to_replace` must be a
-mapping between a value and a replacement.
+If the value is a dict, then `value` is ignored or can be 
omitted, and `to_replace`
--- End diff --

now there is no parameter named `value`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166217503
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1587,6 +1597,51 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null|null|
 ++--++
 """
+
+# It deals with a problem when 'value' is set to None and 
'to_replace' is a dictionary.
+# Validate if arguments are missing or not.
+is_more_than_two = len(args) + len(kwargs) > 2
--- End diff --

I read this few times and still feel that this is kind of verbose. But 
seems there is no better way to check if an optional parameter is set or not in 
Python.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r166216324
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2175,7 +2175,7 @@ def test_replace(self):
 
 # replace with subset specified by a string of a column name w/ 
actual change
 row = self.spark.createDataFrame(
-[(u'Alice', 10, 80.1)], schema).replace(10, 20, 
subset='age').first()
+[(u'Alice', 10, 80.1)], schema).replace(10, 'age', 
value=20).first()
--- End diff --

Will this conflict with the convention of function argument in Python?

Usually, I think the arguments before keyword arg are resolved by position. 
But now `age` is resolved to `subset` which is the third argument behind 
`value`.

Since the function signature is changed, this may not be a big issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r165951235
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2186,7 +2186,7 @@ def test_replace(self):
 # replace with subset specified with one column replaced, another 
column not in subset
 # stays unchanged.
 row = self.spark.createDataFrame(
-[(u'Alice', 10, 10.0)], schema).replace(10, 20, 
subset=['name', 'age']).first()
+[(u'Alice', 10, 10.0)], schema).replace(10, value=20, 
subset=['name', 'age']).first()
--- End diff --

I don't think it's necessary but let me keep them since at least it tests 
different combinations of valid cases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-05 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r165950192
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1557,6 +1557,9 @@ def replace(self, to_replace, value=None, 
subset=None):
 For example, if `value` is a string, and subset contains a 
non-string column,
 then the non-string column is simply ignored.
 
+.. note:: `value` can only be omitted when `to_replace` is a 
dictionary. Otherwise,
+it is required.
--- End diff --

Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r165883566
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2186,7 +2186,7 @@ def test_replace(self):
 # replace with subset specified with one column replaced, another 
column not in subset
 # stays unchanged.
 row = self.spark.createDataFrame(
-[(u'Alice', 10, 10.0)], schema).replace(10, 20, 
subset=['name', 'age']).first()
+[(u'Alice', 10, 10.0)], schema).replace(10, value=20, 
subset=['name', 'age']).first()
--- End diff --

Are the above two test changes necessary?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r165884365
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1587,6 +1600,52 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null|null|
 ++--++
 """
+
+# It deals with a problem when 'value' is set to None and 
'to_replace' is a dictionary.
+# Validate if arguments are missing or not.
+is_more_than_two = len(args) + len(kwargs) > 2
+if is_more_than_two:
+raise TypeError(
+"The number of arguments should not be more than 3; 
however, got "
+"%s arguments." % len([to_replace] + list(args) + 
list(kwargs.values(
+
+is_unexpected_kwargs = \
+len(kwargs) == 2 and ("value" not in kwargs or "subset" not in 
kwargs)
--- End diff --

```python
df.na.replace({'Alice': 'Bob'}, foo = 'bar').show()
```

Seems this case can't be detected?






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r165883051
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1557,6 +1557,9 @@ def replace(self, to_replace, value=None, 
subset=None):
 For example, if `value` is a string, and subset contains a 
non-string column,
 then the non-string column is simply ignored.
 
+.. note:: `value` can only be omitted when `to_replace` is a 
dictionary. Otherwise,
+it is required.
--- End diff --

Shall we just describe this in `value`'s param doc?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r165834947
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1587,6 +1600,51 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null|null|
 ++--++
 """
+
+# It deals with a problem when 'value' is set to None and 
'to_replace' is a dictionary.
+is_more_than_two = len(args) + len(kwargs) > 2
+if is_more_than_two:
+raise TypeError(
+"Arguments are expected to be to_replace, value and 
subset; however, got "
+"[%s]" % ", ".join(map(str, [to_replace] + list(args) + 
list(kwargs.values()
+
+is_unexpected_kwargs = \
+len(kwargs) == 2 and ("value" not in kwargs or "subset" not in 
kwargs)
+if is_unexpected_kwargs:
+raise TypeError(
+"Arguments are expected to be to_replace, value and 
subset; however, got "
+"unexpected keyword arguments: [%s]" % ", ".join(
+filter(lambda a: a not in ["value", "subset"], 
kwargs.keys(
+
+if "value" in kwargs and "subset" in kwargs:
+# replace(..., value=..., subset=...)
+# replace(..., subset=..., value=...)
+value = kwargs["value"]
+subset = kwargs["subset"]
+elif "value" in kwargs:
+# replace(..., ..., value=...)
+value = kwargs["value"]
+subset = args[0] if len(args) == 1 else None
+elif "subset" in kwargs:
+# replace(..., ..., subset=...)
+if len(args) == 1:
+value = args[0]
+elif isinstance(to_replace, dict):
+value = None  # When to_replace is a dictionary, value can 
be omitted.
+else:
+raise TypeError("value is required when to_replace is not 
a dictionary.")
--- End diff --

Hm, I think that check is still valid. The newly added logic here focuses 
on checking missing arguments whereas the below logics focus on checking if 
arguments are valid.

Will try to add a explicit test for 
https://github.com/apache/spark/pull/20499#discussion_r165828448 case with few 
comment changes.

For https://github.com/apache/spark/pull/20499#discussion_r165828519, I 
just tried to check. Seems we should keep that `None` to support:

```python
>>> df.na.replace('Alice', None).show()
```

```
++--++
| age|height|name|
++--++
|  10|80|null|
...
```

```
...
ValueError: If to_replace is not a dict, value should be a bool, float, 
int, long, string, list, tuple or None. Got 
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-03 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r165828519
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1587,6 +1600,51 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null|null|
 ++--++
 """
+
+# It deals with a problem when 'value' is set to None and 
'to_replace' is a dictionary.
+is_more_than_two = len(args) + len(kwargs) > 2
+if is_more_than_two:
+raise TypeError(
+"Arguments are expected to be to_replace, value and 
subset; however, got "
+"[%s]" % ", ".join(map(str, [to_replace] + list(args) + 
list(kwargs.values()
+
+is_unexpected_kwargs = \
+len(kwargs) == 2 and ("value" not in kwargs or "subset" not in 
kwargs)
+if is_unexpected_kwargs:
+raise TypeError(
+"Arguments are expected to be to_replace, value and 
subset; however, got "
+"unexpected keyword arguments: [%s]" % ", ".join(
+filter(lambda a: a not in ["value", "subset"], 
kwargs.keys(
+
+if "value" in kwargs and "subset" in kwargs:
+# replace(..., value=..., subset=...)
+# replace(..., subset=..., value=...)
+value = kwargs["value"]
+subset = kwargs["subset"]
+elif "value" in kwargs:
+# replace(..., ..., value=...)
+value = kwargs["value"]
+subset = args[0] if len(args) == 1 else None
+elif "subset" in kwargs:
+# replace(..., ..., subset=...)
+if len(args) == 1:
+value = args[0]
+elif isinstance(to_replace, dict):
+value = None  # When to_replace is a dictionary, value can 
be omitted.
+else:
+raise TypeError("value is required when to_replace is not 
a dictionary.")
--- End diff --

Btw, can't we just remove `value is not None` in above to let `None` 
disallowed when `to_replace` is not a dict?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20499: [SPARK-23328][PYTHON] Disallow default value None...

2018-02-03 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20499#discussion_r165828448
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1587,6 +1600,51 @@ def replace(self, to_replace, value=None, 
subset=None):
 |null|  null|null|
 ++--++
 """
+
+# It deals with a problem when 'value' is set to None and 
'to_replace' is a dictionary.
+is_more_than_two = len(args) + len(kwargs) > 2
+if is_more_than_two:
+raise TypeError(
+"Arguments are expected to be to_replace, value and 
subset; however, got "
+"[%s]" % ", ".join(map(str, [to_replace] + list(args) + 
list(kwargs.values()
+
+is_unexpected_kwargs = \
+len(kwargs) == 2 and ("value" not in kwargs or "subset" not in 
kwargs)
+if is_unexpected_kwargs:
+raise TypeError(
+"Arguments are expected to be to_replace, value and 
subset; however, got "
+"unexpected keyword arguments: [%s]" % ", ".join(
+filter(lambda a: a not in ["value", "subset"], 
kwargs.keys(
+
+if "value" in kwargs and "subset" in kwargs:
+# replace(..., value=..., subset=...)
+# replace(..., subset=..., value=...)
+value = kwargs["value"]
+subset = kwargs["subset"]
+elif "value" in kwargs:
+# replace(..., ..., value=...)
+value = kwargs["value"]
+subset = args[0] if len(args) == 1 else None
+elif "subset" in kwargs:
+# replace(..., ..., subset=...)
+if len(args) == 1:
+value = args[0]
+elif isinstance(to_replace, dict):
+value = None  # When to_replace is a dictionary, value can 
be omitted.
+else:
+raise TypeError("value is required when to_replace is not 
a dictionary.")
--- End diff --

There are some old check below like:
```python
if not isinstance(value, valid_types) and value is not None \
and not isinstance(to_replace, dict):
raise ValueError("If to_replace is not a dict, value should be "
 "a bool, float, int, long, string, list, tuple 
or None. "
 "Got {0}".format(type(value)))
```

Should we clean up it too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org