[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-07-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17227


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-07-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r125204915
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,162 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_exception_msg(self):
+name = "test_name"
+try:
+_verify_type(None, StringType(), nullable=False, name=name)
+self.fail('Expected _verify_type() to throw so test can check 
exception message')
+except Exception as e:
+self.assertTrue(str(e).startswith(name))
+
+def test_verify_type_ok_nullable(self):
+obj = None
+for data_type in [IntegerType(), FloatType(), StringType(), 
StructType([])]:
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+try:
+_verify_type(obj, data_type, nullable=True)
+except Exception as e:
+traceback.print_exc()
+self.fail(msg)
+
+def test_verify_type_not_nullable(self):
+import array
+import datetime
+import decimal
+
+MyStructType = StructType([
--- End diff --

Could we make the first character this lower-cased? (or maybe just simply 
`schema`?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-07-02 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r125205112
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType):
 }
 
 
-def _verify_type(obj, dataType, nullable=True):
+def _verify_type(obj, dataType, nullable=True, name="obj"):
--- End diff --

Could we maybe then `None` and not print?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-26 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r124135474
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType):
 }
 
 
-def _verify_type(obj, dataType, nullable=True):
+def _verify_type(obj, dataType, nullable=True, name="obj"):
--- End diff --

Set `name=value` in the call at session.py line 516.  

It will still print `obj` if the schema is a StructType: `TypeError: obj.a: 
MyStructType can not accept object 'a' in type `.  Would you like 
to change that too?

Right now changing the default name to None would make the error message 
worse: `TypeError: None: IntegerType can not accept object 'a' in type `.  

The best way to make the error message pretty is probably:
- Set the default name to None
- If name==None, don't prepend the `%s: ` string to the error messages

That would make your exmple: `TypeError: IntegerType can not accept object 
'a' in type `.  

IMO `obj` is not as pretty but reasonable since it's so simple.  Let me 
know what you prefer.  My only goal is that next time I get a schema failure it 
tells me what field to look at :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-26 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r124109294
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,157 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
--- End diff --

Sure, either works for me.  Changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123875271
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType):
 }
 
 
-def _verify_type(obj, dataType, nullable=True):
+def _verify_type(obj, dataType, nullable=True, name="obj"):
--- End diff --

I guess this is only place that we print "obj" maybe? If so, let's set 
`name=None`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123875195
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType):
 }
 
 
-def _verify_type(obj, dataType, nullable=True):
+def _verify_type(obj, dataType, nullable=True, name="obj"):
--- End diff --

Let's fix this case.

```python
>>> from pyspark.sql.types import *
>>> spark.createDataFrame(["a"], StringType()).printSchema()
```

```
root
 |-- value: string (nullable = true)
```
```python
>>> from pyspark.sql.types import *
>>> spark.createDataFrame(["a"], IntegerType()).printSchema()
```
```
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/session.py", line 526, in 
createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File ".../spark/python/pyspark/sql/session.py", line 387, in 
_createFromLocal
data = list(data)
  File ".../spark/python/pyspark/sql/session.py", line 516, in prepare
verify_func(obj, dataType)
  File ".../spark/python/pyspark/sql/types.py", line 1326, in _verify_type
% (name, dataType, obj, type(obj)))
TypeError: obj: IntegerType can not accept object 'a' in type 
```

It sounds "obj" should be "value". It looks we should specify the name 
around 
https://github.com/dgingrich/spark/blob/topic-spark-19507-verify-types/python/pyspark/sql/session.py#L516.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123875088
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType):
 }
 
 
-def _verify_type(obj, dataType, nullable=True):
+def _verify_type(obj, dataType, nullable=True, name="obj"):
--- End diff --

I meant this case:

```python
>>> from pyspark.sql.types import *
>>> spark.createDataFrame(["a"], StringType()).printSchema()
```

```
root
 |-- value: string (nullable = true)
```
```python
>>> from pyspark.sql.types import *
>>> spark.createDataFrame(["a"], IntegerType()).printSchema()
```
```
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/session.py", line 526, in 
createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File ".../spark/python/pyspark/sql/session.py", line 387, in 
_createFromLocal
data = list(data)
  File ".../spark/python/pyspark/sql/session.py", line 516, in prepare
verify_func(obj, dataType)
  File ".../spark/python/pyspark/sql/types.py", line 1326, in _verify_type
% (name, dataType, obj, type(obj)))
TypeError: obj: IntegerType can not accept object 'a' in type 
```

It sounds "obj" should be "value". It looks we should specify the name 
around 
https://github.com/dgingrich/spark/blob/topic-spark-19507-verify-types/python/pyspark/sql/session.py#L516.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123874870
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,157 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
--- End diff --

Let's do this like ...

```python
types = [IntegerType()), FloatType()), StringType()), StructType([])]
for ...
```

if you don't mind. I think taking out the same value in a loop is slightly 
better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123874817
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -30,6 +30,19 @@
 import functools
 import time
 import datetime
+import traceback
+
+if sys.version_info[:2] <= (2, 6):
--- End diff --

Yea, let's leave it then. Not a big deal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-22 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123615521
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1300,70 +1300,80 @@ def _verify_type(obj, dataType, nullable=True):
 if nullable:
 return
 else:
-raise ValueError("This field is not nullable, but got None")
+raise ValueError("%s: This field is not nullable, but got 
None" % name)
--- End diff --

No, I never check the actual exception message.  I normally don't check the 
contents of exception messages since they shouldn't be used programmatically 
(the tests are mostly to exercise all code paths to make sure I didn't break 
something).

But it makes sense to test the prefix is set since that's the main point of 
the change.  Added a test looking at the exception message prefix, which should 
be robust.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-22 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123611855
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType):
 }
 
 
-def _verify_type(obj, dataType, nullable=True):
+def _verify_type(obj, dataType, nullable=True, name="obj"):
--- End diff --

This will print "obj" when called from `session.createDataFrame` 
(https://github.com/dgingrich/spark/blob/topic-spark-19507-verify-types/python/pyspark/sql/session.py#L408).
  It'd be easy to set the name where it's called but it wasn't clear what to 
set it to.   The input can be either an RDD, list, or pandas.DataFrame.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-22 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123610140
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,157 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+try:
+_verify_type(obj, data_type, nullable=True)
+except Exception as e:
+traceback.print_exc()
+self.fail(msg)
+
+def test_verify_type_not_nullable(self):
+import array
+import datetime
+import decimal
+
+MyStructType = StructType([
+StructField('s', StringType(), nullable=False),
+StructField('i', IntegerType(), nullable=True)])
+
+class MyObj:
+def __init__(self, **ka):
+for k, v in ka.items():
+setattr(self, k, v)
+
+# obj, data_type, exception (None for success or Exception 
subclass for error)
+spec = [
+# Strings (match anything but None)
+("", StringType(), None),
+(u"", StringType(), None),
+(1, StringType(), None),
+(1.0, StringType(), None),
+([], StringType(), None),
+({}, StringType(), None),
+(None, StringType(), ValueError),   # Only None test
+
+# UDT
+(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None),
+(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError),
+
+# Boolean
+(True, BooleanType(), None),
+(1, BooleanType(), TypeError),
+("True", BooleanType(), TypeError),
+([1], BooleanType(), TypeError),
+
+# Bytes
+(-(2**7) - 1, ByteType(), ValueError),
+(-(2**7), ByteType(), None),
+(2**7 - 1, ByteType(), None),
+(2**7, ByteType(), ValueError),
+("1", ByteType(), TypeError),
+(1.0, ByteType(), TypeError),
+
+# Shorts
+(-(2**15) - 1, ShortType(), ValueError),
+(-(2**15), ShortType(), None),
+(2**15 - 1, ShortType(), None),
+(2**15, ShortType(), ValueError),
+
+# Integer
+(-(2**31) - 1, IntegerType(), ValueError),
+(-(2**31), IntegerType(), None),
+(2**31 - 1, IntegerType(), None),
+(2**31, IntegerType(), ValueError),
+
+# Long
+(2**64, LongType(), None),
+
+# Float & Double
+(1.0, FloatType(), None),
+(1, FloatType(), TypeError),
+(1.0, DoubleType(), None),
+(1, DoubleType(), TypeError),
+
+# Decimal
+(decimal.Decimal("1.0"), DecimalType(), None),
+(1.0, DecimalType(), TypeError),
+(1, DecimalType(), TypeError),
+("1.0", DecimalType(), TypeError),
+
+# Binary
+(bytearray([1, 2]), BinaryType(), None),
+(1, BinaryType(), TypeError),
+
+# Date/Time
+(datetime.date(2000, 1, 2), DateType(), None),
+(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None),
+("2000-01-02", DateType(), TypeError),
+(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None),
+(946811040, TimestampType(), TypeError),
+
+# Array
+([], ArrayType(IntegerType()), None),
+(["1", None], ArrayType(StringType(), containsNull=True), 
None),
+(["1", None], ArrayType(StringType(), containsNull=False), 
ValueError),
+([1, 2], ArrayType(IntegerType()), None),
+([1, "2"], ArrayType(IntegerType()), TypeError),
+((1, 2), ArrayType(IntegerType()), None),
+(array.array('h', [1, 2]), ArrayType(IntegerType()), None),
+
+# Map
+({}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError),
+({"a": "1"}, MapType(StringType(), IntegerType()), TypeError),
+({"a": 

[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-22 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123610062
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,157 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
--- End diff --

Meaning remove the None from the tuples and just loop over the Types?  I 
like it a little better as is since the tuples are basically `_verify_type`'s 
args but am fine with either.  Let me know which you prefer and I can change 
the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-22 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123609093
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -30,6 +30,19 @@
 import functools
 import time
 import datetime
+import traceback
+
+if sys.version_info[:2] <= (2, 6):
--- End diff --

Looks like most of the other tests still have the `<= (2, 6)` check (see 
python/pyspark/ml/tests.py) so leaving in place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123134376
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1300,70 +1300,80 @@ def _verify_type(obj, dataType, nullable=True):
 if nullable:
 return
 else:
-raise ValueError("This field is not nullable, but got None")
+raise ValueError("%s: This field is not nullable, but got 
None" % name)
--- End diff --

Probably, I missed something. However, is there any test case that actually 
checks this message change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123133489
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -30,6 +30,19 @@
 import functools
 import time
 import datetime
+import traceback
+
+if sys.version_info[:2] <= (2, 6):
--- End diff --

Not a big deal but I guess we dropped 2.6 support.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123134659
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,157 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+try:
+_verify_type(obj, data_type, nullable=True)
+except Exception as e:
+traceback.print_exc()
+self.fail(msg)
+
+def test_verify_type_not_nullable(self):
+import array
+import datetime
+import decimal
+
+MyStructType = StructType([
+StructField('s', StringType(), nullable=False),
+StructField('i', IntegerType(), nullable=True)])
+
+class MyObj:
+def __init__(self, **ka):
+for k, v in ka.items():
+setattr(self, k, v)
+
+# obj, data_type, exception (None for success or Exception 
subclass for error)
+spec = [
+# Strings (match anything but None)
+("", StringType(), None),
+(u"", StringType(), None),
+(1, StringType(), None),
+(1.0, StringType(), None),
+([], StringType(), None),
+({}, StringType(), None),
+(None, StringType(), ValueError),   # Only None test
+
+# UDT
+(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None),
+(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError),
+
+# Boolean
+(True, BooleanType(), None),
+(1, BooleanType(), TypeError),
+("True", BooleanType(), TypeError),
+([1], BooleanType(), TypeError),
+
+# Bytes
+(-(2**7) - 1, ByteType(), ValueError),
+(-(2**7), ByteType(), None),
+(2**7 - 1, ByteType(), None),
+(2**7, ByteType(), ValueError),
+("1", ByteType(), TypeError),
+(1.0, ByteType(), TypeError),
+
+# Shorts
+(-(2**15) - 1, ShortType(), ValueError),
+(-(2**15), ShortType(), None),
+(2**15 - 1, ShortType(), None),
+(2**15, ShortType(), ValueError),
+
+# Integer
+(-(2**31) - 1, IntegerType(), ValueError),
+(-(2**31), IntegerType(), None),
+(2**31 - 1, IntegerType(), None),
+(2**31, IntegerType(), ValueError),
+
+# Long
+(2**64, LongType(), None),
+
+# Float & Double
+(1.0, FloatType(), None),
+(1, FloatType(), TypeError),
+(1.0, DoubleType(), None),
+(1, DoubleType(), TypeError),
+
+# Decimal
+(decimal.Decimal("1.0"), DecimalType(), None),
+(1.0, DecimalType(), TypeError),
+(1, DecimalType(), TypeError),
+("1.0", DecimalType(), TypeError),
+
+# Binary
+(bytearray([1, 2]), BinaryType(), None),
+(1, BinaryType(), TypeError),
+
+# Date/Time
+(datetime.date(2000, 1, 2), DateType(), None),
+(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None),
+("2000-01-02", DateType(), TypeError),
+(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None),
+(946811040, TimestampType(), TypeError),
+
+# Array
+([], ArrayType(IntegerType()), None),
+(["1", None], ArrayType(StringType(), containsNull=True), 
None),
+(["1", None], ArrayType(StringType(), containsNull=False), 
ValueError),
+([1, 2], ArrayType(IntegerType()), None),
+([1, "2"], ArrayType(IntegerType()), TypeError),
+((1, 2), ArrayType(IntegerType()), None),
+(array.array('h', [1, 2]), ArrayType(IntegerType()), None),
+
+# Map
+({}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError),
+({"a": "1"}, MapType(StringType(), IntegerType()), TypeError),
+({"a": 

[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123133739
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,157 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
--- End diff --

Not a big deal too. Could we just take this out in a separate variable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123134991
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType):
 }
 
 
-def _verify_type(obj, dataType, nullable=True):
+def _verify_type(obj, dataType, nullable=True, name="obj"):
--- End diff --

Just a question. @dgingrich Do you maybe know if there is any change that 
"obj" is printed instead? It is rather a nitpick but I would think it is odds 
if it prints "obj".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123133997
  
--- Diff: python/pyspark/rdd.py ---
@@ -627,7 +627,6 @@ def sortPartition(iterator):
 def sortByKey(self, ascending=True, numPartitions=None, keyfunc=lambda 
x: x):
 """
 Sorts this RDD, which is assumed to consist of (key, value) pairs.
-# noqa
--- End diff --

(I have no idea why this was added in the first place ...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-20 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123104696
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,151 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+_verify_type(obj, data_type, nullable=True)
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+self.assertTrue(True, msg)
+
+def test_verify_type_not_nullable(self):
+import array
+import datetime
+import decimal
+
+MyStructType = StructType([
+StructField('s', StringType(), nullable=False),
+StructField('i', IntegerType(), nullable=True)])
+
+class MyObj:
+def __init__(self, **ka):
+for k, v in ka.items():
+setattr(self, k, v)
+
+# obj, data_type, exception (None for success or Exception 
subclass for error)
+spec = [
+# Strings (match anything but None)
+("", StringType(), None),
+(u"", StringType(), None),
+(1, StringType(), None),
+(1.0, StringType(), None),
+([], StringType(), None),
+({}, StringType(), None),
+(None, StringType(), ValueError),   # Only None test
+
+# UDT
+(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None),
+(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError),
+
+# Boolean
+(True, BooleanType(), None),
+(1, BooleanType(), TypeError),
+("True", BooleanType(), TypeError),
+([1], BooleanType(), TypeError),
+
+# Bytes
+(-(2**7) - 1, ByteType(), ValueError),
+(-(2**7), ByteType(), None),
+(2**7 - 1, ByteType(), None),
+(2**7, ByteType(), ValueError),
+("1", ByteType(), TypeError),
+(1.0, ByteType(), TypeError),
+
+# Shorts
+(-(2**15) - 1, ShortType(), ValueError),
+(-(2**15), ShortType(), None),
+(2**15 - 1, ShortType(), None),
+(2**15, ShortType(), ValueError),
+
+# Integer
+(-(2**31) - 1, IntegerType(), ValueError),
+(-(2**31), IntegerType(), None),
+(2**31 - 1, IntegerType(), None),
+(2**31, IntegerType(), ValueError),
+
+# Long
+(2**64, LongType(), None),
+
+# Float & Double
+(1.0, FloatType(), None),
+(1, FloatType(), TypeError),
+(1.0, DoubleType(), None),
+(1, DoubleType(), TypeError),
+
+# Decimal
+(decimal.Decimal("1.0"), DecimalType(), None),
+(1.0, DecimalType(), TypeError),
+(1, DecimalType(), TypeError),
+("1.0", DecimalType(), TypeError),
+
+# Binary
+(bytearray([1, 2]), BinaryType(), None),
+(1, BinaryType(), TypeError),
+
+# Date/Time
+(datetime.date(2000, 1, 2), DateType(), None),
+(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None),
+("2000-01-02", DateType(), TypeError),
+(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None),
+(946811040, TimestampType(), TypeError),
+
+# Array
+([], ArrayType(IntegerType()), None),
+(["1", None], ArrayType(StringType(), containsNull=True), 
None),
+([1, 2], ArrayType(IntegerType()), None),
+([1, "2"], ArrayType(IntegerType()), TypeError),
+((1, 2), ArrayType(IntegerType()), None),
+(array.array('h', [1, 2]), ArrayType(IntegerType()), None),
+
+# Map
+({}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError),
+({"a": "1"}, MapType(StringType(), IntegerType()), TypeError),
+({"a": None}, MapType(StringType(), IntegerType(), 
valueContainsNull=True), None),
+
+# Struct
+({"s": "a", "i": 1}, MyStructType, None),
+({"s": "a", 

[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-20 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123100185
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,151 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+_verify_type(obj, data_type, nullable=True)
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+self.assertTrue(True, msg)
+
+def test_verify_type_not_nullable(self):
+import array
+import datetime
+import decimal
+
+MyStructType = StructType([
+StructField('s', StringType(), nullable=False),
+StructField('i', IntegerType(), nullable=True)])
+
+class MyObj:
+def __init__(self, **ka):
+for k, v in ka.items():
+setattr(self, k, v)
+
+# obj, data_type, exception (None for success or Exception 
subclass for error)
+spec = [
+# Strings (match anything but None)
+("", StringType(), None),
+(u"", StringType(), None),
+(1, StringType(), None),
+(1.0, StringType(), None),
+([], StringType(), None),
+({}, StringType(), None),
+(None, StringType(), ValueError),   # Only None test
+
+# UDT
+(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None),
+(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError),
+
+# Boolean
+(True, BooleanType(), None),
+(1, BooleanType(), TypeError),
+("True", BooleanType(), TypeError),
+([1], BooleanType(), TypeError),
+
+# Bytes
+(-(2**7) - 1, ByteType(), ValueError),
+(-(2**7), ByteType(), None),
+(2**7 - 1, ByteType(), None),
+(2**7, ByteType(), ValueError),
+("1", ByteType(), TypeError),
+(1.0, ByteType(), TypeError),
+
+# Shorts
+(-(2**15) - 1, ShortType(), ValueError),
+(-(2**15), ShortType(), None),
+(2**15 - 1, ShortType(), None),
+(2**15, ShortType(), ValueError),
+
+# Integer
+(-(2**31) - 1, IntegerType(), ValueError),
+(-(2**31), IntegerType(), None),
+(2**31 - 1, IntegerType(), None),
+(2**31, IntegerType(), ValueError),
+
+# Long
+(2**64, LongType(), None),
+
+# Float & Double
+(1.0, FloatType(), None),
+(1, FloatType(), TypeError),
+(1.0, DoubleType(), None),
+(1, DoubleType(), TypeError),
+
+# Decimal
+(decimal.Decimal("1.0"), DecimalType(), None),
+(1.0, DecimalType(), TypeError),
+(1, DecimalType(), TypeError),
+("1.0", DecimalType(), TypeError),
+
+# Binary
+(bytearray([1, 2]), BinaryType(), None),
+(1, BinaryType(), TypeError),
+
+# Date/Time
+(datetime.date(2000, 1, 2), DateType(), None),
+(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None),
+("2000-01-02", DateType(), TypeError),
+(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None),
+(946811040, TimestampType(), TypeError),
+
+# Array
+([], ArrayType(IntegerType()), None),
+(["1", None], ArrayType(StringType(), containsNull=True), 
None),
+([1, 2], ArrayType(IntegerType()), None),
+([1, "2"], ArrayType(IntegerType()), TypeError),
+((1, 2), ArrayType(IntegerType()), None),
+(array.array('h', [1, 2]), ArrayType(IntegerType()), None),
+
+# Map
+({}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError),
+({"a": "1"}, MapType(StringType(), IntegerType()), TypeError),
+({"a": None}, MapType(StringType(), IntegerType(), 
valueContainsNull=True), None),
+
+# Struct
+({"s": "a", "i": 1}, MyStructType, None),
+({"s": "a", 

[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-20 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123099962
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,151 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+_verify_type(obj, data_type, nullable=True)
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+self.assertTrue(True, msg)
+
+def test_verify_type_not_nullable(self):
+import array
+import datetime
+import decimal
+
+MyStructType = StructType([
+StructField('s', StringType(), nullable=False),
+StructField('i', IntegerType(), nullable=True)])
+
+class MyObj:
+def __init__(self, **ka):
+for k, v in ka.items():
+setattr(self, k, v)
+
+# obj, data_type, exception (None for success or Exception 
subclass for error)
+spec = [
+# Strings (match anything but None)
+("", StringType(), None),
+(u"", StringType(), None),
+(1, StringType(), None),
+(1.0, StringType(), None),
+([], StringType(), None),
+({}, StringType(), None),
+(None, StringType(), ValueError),   # Only None test
+
+# UDT
+(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None),
+(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError),
+
+# Boolean
+(True, BooleanType(), None),
+(1, BooleanType(), TypeError),
+("True", BooleanType(), TypeError),
+([1], BooleanType(), TypeError),
+
+# Bytes
+(-(2**7) - 1, ByteType(), ValueError),
+(-(2**7), ByteType(), None),
+(2**7 - 1, ByteType(), None),
+(2**7, ByteType(), ValueError),
+("1", ByteType(), TypeError),
+(1.0, ByteType(), TypeError),
+
+# Shorts
+(-(2**15) - 1, ShortType(), ValueError),
+(-(2**15), ShortType(), None),
+(2**15 - 1, ShortType(), None),
+(2**15, ShortType(), ValueError),
+
+# Integer
+(-(2**31) - 1, IntegerType(), ValueError),
+(-(2**31), IntegerType(), None),
+(2**31 - 1, IntegerType(), None),
+(2**31, IntegerType(), ValueError),
+
+# Long
+(2**64, LongType(), None),
+
+# Float & Double
+(1.0, FloatType(), None),
+(1, FloatType(), TypeError),
+(1.0, DoubleType(), None),
+(1, DoubleType(), TypeError),
+
+# Decimal
+(decimal.Decimal("1.0"), DecimalType(), None),
+(1.0, DecimalType(), TypeError),
+(1, DecimalType(), TypeError),
+("1.0", DecimalType(), TypeError),
+
+# Binary
+(bytearray([1, 2]), BinaryType(), None),
+(1, BinaryType(), TypeError),
+
+# Date/Time
+(datetime.date(2000, 1, 2), DateType(), None),
+(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None),
+("2000-01-02", DateType(), TypeError),
+(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None),
+(946811040, TimestampType(), TypeError),
+
+# Array
+([], ArrayType(IntegerType()), None),
+(["1", None], ArrayType(StringType(), containsNull=True), 
None),
+([1, 2], ArrayType(IntegerType()), None),
+([1, "2"], ArrayType(IntegerType()), TypeError),
+((1, 2), ArrayType(IntegerType()), None),
+(array.array('h', [1, 2]), ArrayType(IntegerType()), None),
+
+# Map
+({}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError),
+({"a": "1"}, MapType(StringType(), IntegerType()), TypeError),
+({"a": None}, MapType(StringType(), IntegerType(), 
valueContainsNull=True), None),
--- End diff --

Added


---
If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-20 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123099761
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,151 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+_verify_type(obj, data_type, nullable=True)
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+self.assertTrue(True, msg)
+
+def test_verify_type_not_nullable(self):
+import array
+import datetime
+import decimal
+
+MyStructType = StructType([
+StructField('s', StringType(), nullable=False),
+StructField('i', IntegerType(), nullable=True)])
+
+class MyObj:
+def __init__(self, **ka):
+for k, v in ka.items():
+setattr(self, k, v)
+
+# obj, data_type, exception (None for success or Exception 
subclass for error)
+spec = [
+# Strings (match anything but None)
+("", StringType(), None),
+(u"", StringType(), None),
+(1, StringType(), None),
+(1.0, StringType(), None),
+([], StringType(), None),
+({}, StringType(), None),
+(None, StringType(), ValueError),   # Only None test
+
+# UDT
+(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None),
+(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError),
+
+# Boolean
+(True, BooleanType(), None),
+(1, BooleanType(), TypeError),
+("True", BooleanType(), TypeError),
+([1], BooleanType(), TypeError),
+
+# Bytes
+(-(2**7) - 1, ByteType(), ValueError),
+(-(2**7), ByteType(), None),
+(2**7 - 1, ByteType(), None),
+(2**7, ByteType(), ValueError),
+("1", ByteType(), TypeError),
+(1.0, ByteType(), TypeError),
+
+# Shorts
+(-(2**15) - 1, ShortType(), ValueError),
+(-(2**15), ShortType(), None),
+(2**15 - 1, ShortType(), None),
+(2**15, ShortType(), ValueError),
+
+# Integer
+(-(2**31) - 1, IntegerType(), ValueError),
+(-(2**31), IntegerType(), None),
+(2**31 - 1, IntegerType(), None),
+(2**31, IntegerType(), ValueError),
+
+# Long
+(2**64, LongType(), None),
+
+# Float & Double
+(1.0, FloatType(), None),
+(1, FloatType(), TypeError),
+(1.0, DoubleType(), None),
+(1, DoubleType(), TypeError),
+
+# Decimal
+(decimal.Decimal("1.0"), DecimalType(), None),
+(1.0, DecimalType(), TypeError),
+(1, DecimalType(), TypeError),
+("1.0", DecimalType(), TypeError),
+
+# Binary
+(bytearray([1, 2]), BinaryType(), None),
+(1, BinaryType(), TypeError),
+
+# Date/Time
+(datetime.date(2000, 1, 2), DateType(), None),
+(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None),
+("2000-01-02", DateType(), TypeError),
+(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None),
+(946811040, TimestampType(), TypeError),
+
+# Array
+([], ArrayType(IntegerType()), None),
+(["1", None], ArrayType(StringType(), containsNull=True), 
None),
--- End diff --

Added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-20 Thread dgingrich
Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r123099453
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,151 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+_verify_type(obj, data_type, nullable=True)
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+self.assertTrue(True, msg)
--- End diff --

Added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r122282700
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,151 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+_verify_type(obj, data_type, nullable=True)
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+self.assertTrue(True, msg)
+
+def test_verify_type_not_nullable(self):
+import array
+import datetime
+import decimal
+
+MyStructType = StructType([
+StructField('s', StringType(), nullable=False),
+StructField('i', IntegerType(), nullable=True)])
+
+class MyObj:
+def __init__(self, **ka):
+for k, v in ka.items():
+setattr(self, k, v)
+
+# obj, data_type, exception (None for success or Exception 
subclass for error)
+spec = [
+# Strings (match anything but None)
+("", StringType(), None),
+(u"", StringType(), None),
+(1, StringType(), None),
+(1.0, StringType(), None),
+([], StringType(), None),
+({}, StringType(), None),
+(None, StringType(), ValueError),   # Only None test
+
+# UDT
+(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None),
+(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError),
+
+# Boolean
+(True, BooleanType(), None),
+(1, BooleanType(), TypeError),
+("True", BooleanType(), TypeError),
+([1], BooleanType(), TypeError),
+
+# Bytes
+(-(2**7) - 1, ByteType(), ValueError),
+(-(2**7), ByteType(), None),
+(2**7 - 1, ByteType(), None),
+(2**7, ByteType(), ValueError),
+("1", ByteType(), TypeError),
+(1.0, ByteType(), TypeError),
+
+# Shorts
+(-(2**15) - 1, ShortType(), ValueError),
+(-(2**15), ShortType(), None),
+(2**15 - 1, ShortType(), None),
+(2**15, ShortType(), ValueError),
+
+# Integer
+(-(2**31) - 1, IntegerType(), ValueError),
+(-(2**31), IntegerType(), None),
+(2**31 - 1, IntegerType(), None),
+(2**31, IntegerType(), ValueError),
+
+# Long
+(2**64, LongType(), None),
+
+# Float & Double
+(1.0, FloatType(), None),
+(1, FloatType(), TypeError),
+(1.0, DoubleType(), None),
+(1, DoubleType(), TypeError),
+
+# Decimal
+(decimal.Decimal("1.0"), DecimalType(), None),
+(1.0, DecimalType(), TypeError),
+(1, DecimalType(), TypeError),
+("1.0", DecimalType(), TypeError),
+
+# Binary
+(bytearray([1, 2]), BinaryType(), None),
+(1, BinaryType(), TypeError),
+
+# Date/Time
+(datetime.date(2000, 1, 2), DateType(), None),
+(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None),
+("2000-01-02", DateType(), TypeError),
+(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None),
+(946811040, TimestampType(), TypeError),
+
+# Array
+([], ArrayType(IntegerType()), None),
+(["1", None], ArrayType(StringType(), containsNull=True), 
None),
+([1, 2], ArrayType(IntegerType()), None),
+([1, "2"], ArrayType(IntegerType()), TypeError),
+((1, 2), ArrayType(IntegerType()), None),
+(array.array('h', [1, 2]), ArrayType(IntegerType()), None),
+
+# Map
+({}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError),
+({"a": "1"}, MapType(StringType(), IntegerType()), TypeError),
+({"a": None}, MapType(StringType(), IntegerType(), 
valueContainsNull=True), None),
--- End diff --

I'd also like you to add `valueContainsNull=False` case.


---
If your project is set up for 

[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r122283340
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,151 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+_verify_type(obj, data_type, nullable=True)
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+self.assertTrue(True, msg)
+
+def test_verify_type_not_nullable(self):
+import array
+import datetime
+import decimal
+
+MyStructType = StructType([
+StructField('s', StringType(), nullable=False),
+StructField('i', IntegerType(), nullable=True)])
+
+class MyObj:
+def __init__(self, **ka):
+for k, v in ka.items():
+setattr(self, k, v)
+
+# obj, data_type, exception (None for success or Exception 
subclass for error)
+spec = [
+# Strings (match anything but None)
+("", StringType(), None),
+(u"", StringType(), None),
+(1, StringType(), None),
+(1.0, StringType(), None),
+([], StringType(), None),
+({}, StringType(), None),
+(None, StringType(), ValueError),   # Only None test
+
+# UDT
+(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None),
+(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError),
+
+# Boolean
+(True, BooleanType(), None),
+(1, BooleanType(), TypeError),
+("True", BooleanType(), TypeError),
+([1], BooleanType(), TypeError),
+
+# Bytes
+(-(2**7) - 1, ByteType(), ValueError),
+(-(2**7), ByteType(), None),
+(2**7 - 1, ByteType(), None),
+(2**7, ByteType(), ValueError),
+("1", ByteType(), TypeError),
+(1.0, ByteType(), TypeError),
+
+# Shorts
+(-(2**15) - 1, ShortType(), ValueError),
+(-(2**15), ShortType(), None),
+(2**15 - 1, ShortType(), None),
+(2**15, ShortType(), ValueError),
+
+# Integer
+(-(2**31) - 1, IntegerType(), ValueError),
+(-(2**31), IntegerType(), None),
+(2**31 - 1, IntegerType(), None),
+(2**31, IntegerType(), ValueError),
+
+# Long
+(2**64, LongType(), None),
+
+# Float & Double
+(1.0, FloatType(), None),
+(1, FloatType(), TypeError),
+(1.0, DoubleType(), None),
+(1, DoubleType(), TypeError),
+
+# Decimal
+(decimal.Decimal("1.0"), DecimalType(), None),
+(1.0, DecimalType(), TypeError),
+(1, DecimalType(), TypeError),
+("1.0", DecimalType(), TypeError),
+
+# Binary
+(bytearray([1, 2]), BinaryType(), None),
+(1, BinaryType(), TypeError),
+
+# Date/Time
+(datetime.date(2000, 1, 2), DateType(), None),
+(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None),
+("2000-01-02", DateType(), TypeError),
+(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None),
+(946811040, TimestampType(), TypeError),
+
+# Array
+([], ArrayType(IntegerType()), None),
+(["1", None], ArrayType(StringType(), containsNull=True), 
None),
+([1, 2], ArrayType(IntegerType()), None),
+([1, "2"], ArrayType(IntegerType()), TypeError),
+((1, 2), ArrayType(IntegerType()), None),
+(array.array('h', [1, 2]), ArrayType(IntegerType()), None),
+
+# Map
+({}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError),
+({"a": "1"}, MapType(StringType(), IntegerType()), TypeError),
+({"a": None}, MapType(StringType(), IntegerType(), 
valueContainsNull=True), None),
+
+# Struct
+({"s": "a", "i": 1}, MyStructType, None),
+({"s": "a", 

[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r122282508
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,151 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+_verify_type(obj, data_type, nullable=True)
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+self.assertTrue(True, msg)
+
+def test_verify_type_not_nullable(self):
+import array
+import datetime
+import decimal
+
+MyStructType = StructType([
+StructField('s', StringType(), nullable=False),
+StructField('i', IntegerType(), nullable=True)])
+
+class MyObj:
+def __init__(self, **ka):
+for k, v in ka.items():
+setattr(self, k, v)
+
+# obj, data_type, exception (None for success or Exception 
subclass for error)
+spec = [
+# Strings (match anything but None)
+("", StringType(), None),
+(u"", StringType(), None),
+(1, StringType(), None),
+(1.0, StringType(), None),
+([], StringType(), None),
+({}, StringType(), None),
+(None, StringType(), ValueError),   # Only None test
+
+# UDT
+(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None),
+(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError),
+
+# Boolean
+(True, BooleanType(), None),
+(1, BooleanType(), TypeError),
+("True", BooleanType(), TypeError),
+([1], BooleanType(), TypeError),
+
+# Bytes
+(-(2**7) - 1, ByteType(), ValueError),
+(-(2**7), ByteType(), None),
+(2**7 - 1, ByteType(), None),
+(2**7, ByteType(), ValueError),
+("1", ByteType(), TypeError),
+(1.0, ByteType(), TypeError),
+
+# Shorts
+(-(2**15) - 1, ShortType(), ValueError),
+(-(2**15), ShortType(), None),
+(2**15 - 1, ShortType(), None),
+(2**15, ShortType(), ValueError),
+
+# Integer
+(-(2**31) - 1, IntegerType(), ValueError),
+(-(2**31), IntegerType(), None),
+(2**31 - 1, IntegerType(), None),
+(2**31, IntegerType(), ValueError),
+
+# Long
+(2**64, LongType(), None),
+
+# Float & Double
+(1.0, FloatType(), None),
+(1, FloatType(), TypeError),
+(1.0, DoubleType(), None),
+(1, DoubleType(), TypeError),
+
+# Decimal
+(decimal.Decimal("1.0"), DecimalType(), None),
+(1.0, DecimalType(), TypeError),
+(1, DecimalType(), TypeError),
+("1.0", DecimalType(), TypeError),
+
+# Binary
+(bytearray([1, 2]), BinaryType(), None),
+(1, BinaryType(), TypeError),
+
+# Date/Time
+(datetime.date(2000, 1, 2), DateType(), None),
+(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None),
+("2000-01-02", DateType(), TypeError),
+(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None),
+(946811040, TimestampType(), TypeError),
+
+# Array
+([], ArrayType(IntegerType()), None),
+(["1", None], ArrayType(StringType(), containsNull=True), 
None),
--- End diff --

I'd like you to add `containsNull=False` case too which contains `None` in 
the list to verify that it raises `ValueError` correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r122273265
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,151 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+_verify_type(obj, data_type, nullable=True)
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+self.assertTrue(True, msg)
+
+def test_verify_type_not_nullable(self):
+import array
+import datetime
+import decimal
+
+MyStructType = StructType([
+StructField('s', StringType(), nullable=False),
+StructField('i', IntegerType(), nullable=True)])
+
+class MyObj:
+def __init__(self, **ka):
+for k, v in ka.items():
+setattr(self, k, v)
+
+# obj, data_type, exception (None for success or Exception 
subclass for error)
+spec = [
+# Strings (match anything but None)
+("", StringType(), None),
+(u"", StringType(), None),
+(1, StringType(), None),
+(1.0, StringType(), None),
+([], StringType(), None),
+({}, StringType(), None),
+(None, StringType(), ValueError),   # Only None test
+
+# UDT
+(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None),
+(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError),
+
+# Boolean
+(True, BooleanType(), None),
+(1, BooleanType(), TypeError),
+("True", BooleanType(), TypeError),
+([1], BooleanType(), TypeError),
+
+# Bytes
+(-(2**7) - 1, ByteType(), ValueError),
+(-(2**7), ByteType(), None),
+(2**7 - 1, ByteType(), None),
+(2**7, ByteType(), ValueError),
+("1", ByteType(), TypeError),
+(1.0, ByteType(), TypeError),
+
+# Shorts
+(-(2**15) - 1, ShortType(), ValueError),
+(-(2**15), ShortType(), None),
+(2**15 - 1, ShortType(), None),
+(2**15, ShortType(), ValueError),
+
+# Integer
+(-(2**31) - 1, IntegerType(), ValueError),
+(-(2**31), IntegerType(), None),
+(2**31 - 1, IntegerType(), None),
+(2**31, IntegerType(), ValueError),
+
+# Long
+(2**64, LongType(), None),
+
+# Float & Double
+(1.0, FloatType(), None),
+(1, FloatType(), TypeError),
+(1.0, DoubleType(), None),
+(1, DoubleType(), TypeError),
+
+# Decimal
+(decimal.Decimal("1.0"), DecimalType(), None),
+(1.0, DecimalType(), TypeError),
+(1, DecimalType(), TypeError),
+("1.0", DecimalType(), TypeError),
+
+# Binary
+(bytearray([1, 2]), BinaryType(), None),
+(1, BinaryType(), TypeError),
+
+# Date/Time
+(datetime.date(2000, 1, 2), DateType(), None),
+(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None),
+("2000-01-02", DateType(), TypeError),
+(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None),
+(946811040, TimestampType(), TypeError),
+
+# Array
+([], ArrayType(IntegerType()), None),
+(["1", None], ArrayType(StringType(), containsNull=True), 
None),
+([1, 2], ArrayType(IntegerType()), None),
+([1, "2"], ArrayType(IntegerType()), TypeError),
+((1, 2), ArrayType(IntegerType()), None),
+(array.array('h', [1, 2]), ArrayType(IntegerType()), None),
+
+# Map
+({}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(StringType(), IntegerType()), None),
+({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError),
+({"a": "1"}, MapType(StringType(), IntegerType()), TypeError),
+({"a": None}, MapType(StringType(), IntegerType(), 
valueContainsNull=True), None),
+
+# Struct
+({"s": "a", "i": 1}, MyStructType, None),
+({"s": "a", 

[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-06-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17227#discussion_r122287524
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2367,6 +2380,151 @@ def range_frame_match():
 
 importlib.reload(window)
 
+
+class TypesTest(unittest.TestCase):
+
+def test_verify_type_ok_nullable(self):
+for obj, data_type in [
+(None, IntegerType()),
+(None, FloatType()),
+(None, StringType()),
+(None, StructType([]))]:
+_verify_type(obj, data_type, nullable=True)
+msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type)
+self.assertTrue(True, msg)
--- End diff --

I think we should surround `_verify_type(obj, data_type, nullable=True)` 
with try block and check if it raises an exception or not as the same as we do 
in `test_verify_type_not_nullable` test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...

2017-03-09 Thread dgingrich
GitHub user dgingrich opened a pull request:

https://github.com/apache/spark/pull/17227

[SPARK-19507][PySpark][SQL] Show field name in _verify_type error

## What changes were proposed in this pull request?

Add better error messages to _verify_type to track down which columns are 
not compliant with the schema.

## How was this patch tested?

Unit tests (incomplete), doctest, hand inspection in REPL.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dgingrich/spark topic-spark-19507-verify-types

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17227.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17227


commit ad5e5e5e5ed8396efca4c61eb0219fcd5a5e2caf
Author: David Gingrich 
Date:   2017-02-28T08:05:00Z

Remove "# noqa" comment from docstring

commit 5f72a547a948b5c5a787aace52df04bc8503888b
Author: David Gingrich 
Date:   2017-02-28T08:09:59Z

WIP: Add name parameter and better debugging to _verify_types

* Add name paramter to _verify_types
* Include name parameter in debug messages
* Build name message for nested structs, arrays, and maps
* Add detailed tests to flesh out spec for _verify_types (WIP)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org