[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17227 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r125204915 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,162 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_exception_msg(self): +name = "test_name" +try: +_verify_type(None, StringType(), nullable=False, name=name) +self.fail('Expected _verify_type() to throw so test can check exception message') +except Exception as e: +self.assertTrue(str(e).startswith(name)) + +def test_verify_type_ok_nullable(self): +obj = None +for data_type in [IntegerType(), FloatType(), StringType(), StructType([])]: +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +try: +_verify_type(obj, data_type, nullable=True) +except Exception as e: +traceback.print_exc() +self.fail(msg) + +def test_verify_type_not_nullable(self): +import array +import datetime +import decimal + +MyStructType = StructType([ --- End diff -- Could we make the first character this lower-cased? (or maybe just simply `schema`?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r125205112 --- Diff: python/pyspark/sql/types.py --- @@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType): } -def _verify_type(obj, dataType, nullable=True): +def _verify_type(obj, dataType, nullable=True, name="obj"): --- End diff -- Could we maybe then `None` and not print? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r124135474 --- Diff: python/pyspark/sql/types.py --- @@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType): } -def _verify_type(obj, dataType, nullable=True): +def _verify_type(obj, dataType, nullable=True, name="obj"): --- End diff -- Set `name=value` in the call at session.py line 516. It will still print `obj` if the schema is a StructType: `TypeError: obj.a: MyStructType can not accept object 'a' in type `. Would you like to change that too? Right now changing the default name to None would make the error message worse: `TypeError: None: IntegerType can not accept object 'a' in type `. The best way to make the error message pretty is probably: - Set the default name to None - If name==None, don't prepend the `%s: ` string to the error messages That would make your exmple: `TypeError: IntegerType can not accept object 'a' in type `. IMO `obj` is not as pretty but reasonable since it's so simple. Let me know what you prefer. My only goal is that next time I get a schema failure it tells me what field to look at :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r124109294 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,157 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: --- End diff -- Sure, either works for me. Changed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123875271 --- Diff: python/pyspark/sql/types.py --- @@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType): } -def _verify_type(obj, dataType, nullable=True): +def _verify_type(obj, dataType, nullable=True, name="obj"): --- End diff -- I guess this is only place that we print "obj" maybe? If so, let's set `name=None`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123875195 --- Diff: python/pyspark/sql/types.py --- @@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType): } -def _verify_type(obj, dataType, nullable=True): +def _verify_type(obj, dataType, nullable=True, name="obj"): --- End diff -- Let's fix this case. ```python >>> from pyspark.sql.types import * >>> spark.createDataFrame(["a"], StringType()).printSchema() ``` ``` root |-- value: string (nullable = true) ``` ```python >>> from pyspark.sql.types import * >>> spark.createDataFrame(["a"], IntegerType()).printSchema() ``` ``` Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/sql/session.py", line 526, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File ".../spark/python/pyspark/sql/session.py", line 387, in _createFromLocal data = list(data) File ".../spark/python/pyspark/sql/session.py", line 516, in prepare verify_func(obj, dataType) File ".../spark/python/pyspark/sql/types.py", line 1326, in _verify_type % (name, dataType, obj, type(obj))) TypeError: obj: IntegerType can not accept object 'a' in type ``` It sounds "obj" should be "value". It looks we should specify the name around https://github.com/dgingrich/spark/blob/topic-spark-19507-verify-types/python/pyspark/sql/session.py#L516. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123875088 --- Diff: python/pyspark/sql/types.py --- @@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType): } -def _verify_type(obj, dataType, nullable=True): +def _verify_type(obj, dataType, nullable=True, name="obj"): --- End diff -- I meant this case: ```python >>> from pyspark.sql.types import * >>> spark.createDataFrame(["a"], StringType()).printSchema() ``` ``` root |-- value: string (nullable = true) ``` ```python >>> from pyspark.sql.types import * >>> spark.createDataFrame(["a"], IntegerType()).printSchema() ``` ``` Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/sql/session.py", line 526, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File ".../spark/python/pyspark/sql/session.py", line 387, in _createFromLocal data = list(data) File ".../spark/python/pyspark/sql/session.py", line 516, in prepare verify_func(obj, dataType) File ".../spark/python/pyspark/sql/types.py", line 1326, in _verify_type % (name, dataType, obj, type(obj))) TypeError: obj: IntegerType can not accept object 'a' in type ``` It sounds "obj" should be "value". It looks we should specify the name around https://github.com/dgingrich/spark/blob/topic-spark-19507-verify-types/python/pyspark/sql/session.py#L516. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123874870 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,157 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: --- End diff -- Let's do this like ... ```python types = [IntegerType()), FloatType()), StringType()), StructType([])] for ... ``` if you don't mind. I think taking out the same value in a loop is slightly better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123874817 --- Diff: python/pyspark/sql/tests.py --- @@ -30,6 +30,19 @@ import functools import time import datetime +import traceback + +if sys.version_info[:2] <= (2, 6): --- End diff -- Yea, let's leave it then. Not a big deal. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123615521 --- Diff: python/pyspark/sql/types.py --- @@ -1300,70 +1300,80 @@ def _verify_type(obj, dataType, nullable=True): if nullable: return else: -raise ValueError("This field is not nullable, but got None") +raise ValueError("%s: This field is not nullable, but got None" % name) --- End diff -- No, I never check the actual exception message. I normally don't check the contents of exception messages since they shouldn't be used programmatically (the tests are mostly to exercise all code paths to make sure I didn't break something). But it makes sense to test the prefix is set since that's the main point of the change. Added a test looking at the exception message prefix, which should be robust. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123611855 --- Diff: python/pyspark/sql/types.py --- @@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType): } -def _verify_type(obj, dataType, nullable=True): +def _verify_type(obj, dataType, nullable=True, name="obj"): --- End diff -- This will print "obj" when called from `session.createDataFrame` (https://github.com/dgingrich/spark/blob/topic-spark-19507-verify-types/python/pyspark/sql/session.py#L408). It'd be easy to set the name where it's called but it wasn't clear what to set it to. The input can be either an RDD, list, or pandas.DataFrame. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123610140 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,157 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +try: +_verify_type(obj, data_type, nullable=True) +except Exception as e: +traceback.print_exc() +self.fail(msg) + +def test_verify_type_not_nullable(self): +import array +import datetime +import decimal + +MyStructType = StructType([ +StructField('s', StringType(), nullable=False), +StructField('i', IntegerType(), nullable=True)]) + +class MyObj: +def __init__(self, **ka): +for k, v in ka.items(): +setattr(self, k, v) + +# obj, data_type, exception (None for success or Exception subclass for error) +spec = [ +# Strings (match anything but None) +("", StringType(), None), +(u"", StringType(), None), +(1, StringType(), None), +(1.0, StringType(), None), +([], StringType(), None), +({}, StringType(), None), +(None, StringType(), ValueError), # Only None test + +# UDT +(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None), +(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError), + +# Boolean +(True, BooleanType(), None), +(1, BooleanType(), TypeError), +("True", BooleanType(), TypeError), +([1], BooleanType(), TypeError), + +# Bytes +(-(2**7) - 1, ByteType(), ValueError), +(-(2**7), ByteType(), None), +(2**7 - 1, ByteType(), None), +(2**7, ByteType(), ValueError), +("1", ByteType(), TypeError), +(1.0, ByteType(), TypeError), + +# Shorts +(-(2**15) - 1, ShortType(), ValueError), +(-(2**15), ShortType(), None), +(2**15 - 1, ShortType(), None), +(2**15, ShortType(), ValueError), + +# Integer +(-(2**31) - 1, IntegerType(), ValueError), +(-(2**31), IntegerType(), None), +(2**31 - 1, IntegerType(), None), +(2**31, IntegerType(), ValueError), + +# Long +(2**64, LongType(), None), + +# Float & Double +(1.0, FloatType(), None), +(1, FloatType(), TypeError), +(1.0, DoubleType(), None), +(1, DoubleType(), TypeError), + +# Decimal +(decimal.Decimal("1.0"), DecimalType(), None), +(1.0, DecimalType(), TypeError), +(1, DecimalType(), TypeError), +("1.0", DecimalType(), TypeError), + +# Binary +(bytearray([1, 2]), BinaryType(), None), +(1, BinaryType(), TypeError), + +# Date/Time +(datetime.date(2000, 1, 2), DateType(), None), +(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None), +("2000-01-02", DateType(), TypeError), +(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None), +(946811040, TimestampType(), TypeError), + +# Array +([], ArrayType(IntegerType()), None), +(["1", None], ArrayType(StringType(), containsNull=True), None), +(["1", None], ArrayType(StringType(), containsNull=False), ValueError), +([1, 2], ArrayType(IntegerType()), None), +([1, "2"], ArrayType(IntegerType()), TypeError), +((1, 2), ArrayType(IntegerType()), None), +(array.array('h', [1, 2]), ArrayType(IntegerType()), None), + +# Map +({}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError), +({"a": "1"}, MapType(StringType(), IntegerType()), TypeError), +({"a":
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123610062 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,157 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: --- End diff -- Meaning remove the None from the tuples and just loop over the Types? I like it a little better as is since the tuples are basically `_verify_type`'s args but am fine with either. Let me know which you prefer and I can change the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123609093 --- Diff: python/pyspark/sql/tests.py --- @@ -30,6 +30,19 @@ import functools import time import datetime +import traceback + +if sys.version_info[:2] <= (2, 6): --- End diff -- Looks like most of the other tests still have the `<= (2, 6)` check (see python/pyspark/ml/tests.py) so leaving in place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123134376 --- Diff: python/pyspark/sql/types.py --- @@ -1300,70 +1300,80 @@ def _verify_type(obj, dataType, nullable=True): if nullable: return else: -raise ValueError("This field is not nullable, but got None") +raise ValueError("%s: This field is not nullable, but got None" % name) --- End diff -- Probably, I missed something. However, is there any test case that actually checks this message change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123133489 --- Diff: python/pyspark/sql/tests.py --- @@ -30,6 +30,19 @@ import functools import time import datetime +import traceback + +if sys.version_info[:2] <= (2, 6): --- End diff -- Not a big deal but I guess we dropped 2.6 support. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123134659 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,157 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +try: +_verify_type(obj, data_type, nullable=True) +except Exception as e: +traceback.print_exc() +self.fail(msg) + +def test_verify_type_not_nullable(self): +import array +import datetime +import decimal + +MyStructType = StructType([ +StructField('s', StringType(), nullable=False), +StructField('i', IntegerType(), nullable=True)]) + +class MyObj: +def __init__(self, **ka): +for k, v in ka.items(): +setattr(self, k, v) + +# obj, data_type, exception (None for success or Exception subclass for error) +spec = [ +# Strings (match anything but None) +("", StringType(), None), +(u"", StringType(), None), +(1, StringType(), None), +(1.0, StringType(), None), +([], StringType(), None), +({}, StringType(), None), +(None, StringType(), ValueError), # Only None test + +# UDT +(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None), +(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError), + +# Boolean +(True, BooleanType(), None), +(1, BooleanType(), TypeError), +("True", BooleanType(), TypeError), +([1], BooleanType(), TypeError), + +# Bytes +(-(2**7) - 1, ByteType(), ValueError), +(-(2**7), ByteType(), None), +(2**7 - 1, ByteType(), None), +(2**7, ByteType(), ValueError), +("1", ByteType(), TypeError), +(1.0, ByteType(), TypeError), + +# Shorts +(-(2**15) - 1, ShortType(), ValueError), +(-(2**15), ShortType(), None), +(2**15 - 1, ShortType(), None), +(2**15, ShortType(), ValueError), + +# Integer +(-(2**31) - 1, IntegerType(), ValueError), +(-(2**31), IntegerType(), None), +(2**31 - 1, IntegerType(), None), +(2**31, IntegerType(), ValueError), + +# Long +(2**64, LongType(), None), + +# Float & Double +(1.0, FloatType(), None), +(1, FloatType(), TypeError), +(1.0, DoubleType(), None), +(1, DoubleType(), TypeError), + +# Decimal +(decimal.Decimal("1.0"), DecimalType(), None), +(1.0, DecimalType(), TypeError), +(1, DecimalType(), TypeError), +("1.0", DecimalType(), TypeError), + +# Binary +(bytearray([1, 2]), BinaryType(), None), +(1, BinaryType(), TypeError), + +# Date/Time +(datetime.date(2000, 1, 2), DateType(), None), +(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None), +("2000-01-02", DateType(), TypeError), +(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None), +(946811040, TimestampType(), TypeError), + +# Array +([], ArrayType(IntegerType()), None), +(["1", None], ArrayType(StringType(), containsNull=True), None), +(["1", None], ArrayType(StringType(), containsNull=False), ValueError), +([1, 2], ArrayType(IntegerType()), None), +([1, "2"], ArrayType(IntegerType()), TypeError), +((1, 2), ArrayType(IntegerType()), None), +(array.array('h', [1, 2]), ArrayType(IntegerType()), None), + +# Map +({}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError), +({"a": "1"}, MapType(StringType(), IntegerType()), TypeError), +({"a":
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123133739 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,157 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: --- End diff -- Not a big deal too. Could we just take this out in a separate variable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123134991 --- Diff: python/pyspark/sql/types.py --- @@ -1249,7 +1249,7 @@ def _infer_schema_type(obj, dataType): } -def _verify_type(obj, dataType, nullable=True): +def _verify_type(obj, dataType, nullable=True, name="obj"): --- End diff -- Just a question. @dgingrich Do you maybe know if there is any change that "obj" is printed instead? It is rather a nitpick but I would think it is odds if it prints "obj". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123133997 --- Diff: python/pyspark/rdd.py --- @@ -627,7 +627,6 @@ def sortPartition(iterator): def sortByKey(self, ascending=True, numPartitions=None, keyfunc=lambda x: x): """ Sorts this RDD, which is assumed to consist of (key, value) pairs. -# noqa --- End diff -- (I have no idea why this was added in the first place ...) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123104696 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,151 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +_verify_type(obj, data_type, nullable=True) +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +self.assertTrue(True, msg) + +def test_verify_type_not_nullable(self): +import array +import datetime +import decimal + +MyStructType = StructType([ +StructField('s', StringType(), nullable=False), +StructField('i', IntegerType(), nullable=True)]) + +class MyObj: +def __init__(self, **ka): +for k, v in ka.items(): +setattr(self, k, v) + +# obj, data_type, exception (None for success or Exception subclass for error) +spec = [ +# Strings (match anything but None) +("", StringType(), None), +(u"", StringType(), None), +(1, StringType(), None), +(1.0, StringType(), None), +([], StringType(), None), +({}, StringType(), None), +(None, StringType(), ValueError), # Only None test + +# UDT +(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None), +(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError), + +# Boolean +(True, BooleanType(), None), +(1, BooleanType(), TypeError), +("True", BooleanType(), TypeError), +([1], BooleanType(), TypeError), + +# Bytes +(-(2**7) - 1, ByteType(), ValueError), +(-(2**7), ByteType(), None), +(2**7 - 1, ByteType(), None), +(2**7, ByteType(), ValueError), +("1", ByteType(), TypeError), +(1.0, ByteType(), TypeError), + +# Shorts +(-(2**15) - 1, ShortType(), ValueError), +(-(2**15), ShortType(), None), +(2**15 - 1, ShortType(), None), +(2**15, ShortType(), ValueError), + +# Integer +(-(2**31) - 1, IntegerType(), ValueError), +(-(2**31), IntegerType(), None), +(2**31 - 1, IntegerType(), None), +(2**31, IntegerType(), ValueError), + +# Long +(2**64, LongType(), None), + +# Float & Double +(1.0, FloatType(), None), +(1, FloatType(), TypeError), +(1.0, DoubleType(), None), +(1, DoubleType(), TypeError), + +# Decimal +(decimal.Decimal("1.0"), DecimalType(), None), +(1.0, DecimalType(), TypeError), +(1, DecimalType(), TypeError), +("1.0", DecimalType(), TypeError), + +# Binary +(bytearray([1, 2]), BinaryType(), None), +(1, BinaryType(), TypeError), + +# Date/Time +(datetime.date(2000, 1, 2), DateType(), None), +(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None), +("2000-01-02", DateType(), TypeError), +(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None), +(946811040, TimestampType(), TypeError), + +# Array +([], ArrayType(IntegerType()), None), +(["1", None], ArrayType(StringType(), containsNull=True), None), +([1, 2], ArrayType(IntegerType()), None), +([1, "2"], ArrayType(IntegerType()), TypeError), +((1, 2), ArrayType(IntegerType()), None), +(array.array('h', [1, 2]), ArrayType(IntegerType()), None), + +# Map +({}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError), +({"a": "1"}, MapType(StringType(), IntegerType()), TypeError), +({"a": None}, MapType(StringType(), IntegerType(), valueContainsNull=True), None), + +# Struct +({"s": "a", "i": 1}, MyStructType, None), +({"s": "a",
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123100185 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,151 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +_verify_type(obj, data_type, nullable=True) +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +self.assertTrue(True, msg) + +def test_verify_type_not_nullable(self): +import array +import datetime +import decimal + +MyStructType = StructType([ +StructField('s', StringType(), nullable=False), +StructField('i', IntegerType(), nullable=True)]) + +class MyObj: +def __init__(self, **ka): +for k, v in ka.items(): +setattr(self, k, v) + +# obj, data_type, exception (None for success or Exception subclass for error) +spec = [ +# Strings (match anything but None) +("", StringType(), None), +(u"", StringType(), None), +(1, StringType(), None), +(1.0, StringType(), None), +([], StringType(), None), +({}, StringType(), None), +(None, StringType(), ValueError), # Only None test + +# UDT +(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None), +(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError), + +# Boolean +(True, BooleanType(), None), +(1, BooleanType(), TypeError), +("True", BooleanType(), TypeError), +([1], BooleanType(), TypeError), + +# Bytes +(-(2**7) - 1, ByteType(), ValueError), +(-(2**7), ByteType(), None), +(2**7 - 1, ByteType(), None), +(2**7, ByteType(), ValueError), +("1", ByteType(), TypeError), +(1.0, ByteType(), TypeError), + +# Shorts +(-(2**15) - 1, ShortType(), ValueError), +(-(2**15), ShortType(), None), +(2**15 - 1, ShortType(), None), +(2**15, ShortType(), ValueError), + +# Integer +(-(2**31) - 1, IntegerType(), ValueError), +(-(2**31), IntegerType(), None), +(2**31 - 1, IntegerType(), None), +(2**31, IntegerType(), ValueError), + +# Long +(2**64, LongType(), None), + +# Float & Double +(1.0, FloatType(), None), +(1, FloatType(), TypeError), +(1.0, DoubleType(), None), +(1, DoubleType(), TypeError), + +# Decimal +(decimal.Decimal("1.0"), DecimalType(), None), +(1.0, DecimalType(), TypeError), +(1, DecimalType(), TypeError), +("1.0", DecimalType(), TypeError), + +# Binary +(bytearray([1, 2]), BinaryType(), None), +(1, BinaryType(), TypeError), + +# Date/Time +(datetime.date(2000, 1, 2), DateType(), None), +(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None), +("2000-01-02", DateType(), TypeError), +(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None), +(946811040, TimestampType(), TypeError), + +# Array +([], ArrayType(IntegerType()), None), +(["1", None], ArrayType(StringType(), containsNull=True), None), +([1, 2], ArrayType(IntegerType()), None), +([1, "2"], ArrayType(IntegerType()), TypeError), +((1, 2), ArrayType(IntegerType()), None), +(array.array('h', [1, 2]), ArrayType(IntegerType()), None), + +# Map +({}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError), +({"a": "1"}, MapType(StringType(), IntegerType()), TypeError), +({"a": None}, MapType(StringType(), IntegerType(), valueContainsNull=True), None), + +# Struct +({"s": "a", "i": 1}, MyStructType, None), +({"s": "a",
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123099962 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,151 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +_verify_type(obj, data_type, nullable=True) +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +self.assertTrue(True, msg) + +def test_verify_type_not_nullable(self): +import array +import datetime +import decimal + +MyStructType = StructType([ +StructField('s', StringType(), nullable=False), +StructField('i', IntegerType(), nullable=True)]) + +class MyObj: +def __init__(self, **ka): +for k, v in ka.items(): +setattr(self, k, v) + +# obj, data_type, exception (None for success or Exception subclass for error) +spec = [ +# Strings (match anything but None) +("", StringType(), None), +(u"", StringType(), None), +(1, StringType(), None), +(1.0, StringType(), None), +([], StringType(), None), +({}, StringType(), None), +(None, StringType(), ValueError), # Only None test + +# UDT +(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None), +(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError), + +# Boolean +(True, BooleanType(), None), +(1, BooleanType(), TypeError), +("True", BooleanType(), TypeError), +([1], BooleanType(), TypeError), + +# Bytes +(-(2**7) - 1, ByteType(), ValueError), +(-(2**7), ByteType(), None), +(2**7 - 1, ByteType(), None), +(2**7, ByteType(), ValueError), +("1", ByteType(), TypeError), +(1.0, ByteType(), TypeError), + +# Shorts +(-(2**15) - 1, ShortType(), ValueError), +(-(2**15), ShortType(), None), +(2**15 - 1, ShortType(), None), +(2**15, ShortType(), ValueError), + +# Integer +(-(2**31) - 1, IntegerType(), ValueError), +(-(2**31), IntegerType(), None), +(2**31 - 1, IntegerType(), None), +(2**31, IntegerType(), ValueError), + +# Long +(2**64, LongType(), None), + +# Float & Double +(1.0, FloatType(), None), +(1, FloatType(), TypeError), +(1.0, DoubleType(), None), +(1, DoubleType(), TypeError), + +# Decimal +(decimal.Decimal("1.0"), DecimalType(), None), +(1.0, DecimalType(), TypeError), +(1, DecimalType(), TypeError), +("1.0", DecimalType(), TypeError), + +# Binary +(bytearray([1, 2]), BinaryType(), None), +(1, BinaryType(), TypeError), + +# Date/Time +(datetime.date(2000, 1, 2), DateType(), None), +(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None), +("2000-01-02", DateType(), TypeError), +(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None), +(946811040, TimestampType(), TypeError), + +# Array +([], ArrayType(IntegerType()), None), +(["1", None], ArrayType(StringType(), containsNull=True), None), +([1, 2], ArrayType(IntegerType()), None), +([1, "2"], ArrayType(IntegerType()), TypeError), +((1, 2), ArrayType(IntegerType()), None), +(array.array('h', [1, 2]), ArrayType(IntegerType()), None), + +# Map +({}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError), +({"a": "1"}, MapType(StringType(), IntegerType()), TypeError), +({"a": None}, MapType(StringType(), IntegerType(), valueContainsNull=True), None), --- End diff -- Added --- If your project is set up for it, you can reply to this email and have your
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123099761 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,151 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +_verify_type(obj, data_type, nullable=True) +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +self.assertTrue(True, msg) + +def test_verify_type_not_nullable(self): +import array +import datetime +import decimal + +MyStructType = StructType([ +StructField('s', StringType(), nullable=False), +StructField('i', IntegerType(), nullable=True)]) + +class MyObj: +def __init__(self, **ka): +for k, v in ka.items(): +setattr(self, k, v) + +# obj, data_type, exception (None for success or Exception subclass for error) +spec = [ +# Strings (match anything but None) +("", StringType(), None), +(u"", StringType(), None), +(1, StringType(), None), +(1.0, StringType(), None), +([], StringType(), None), +({}, StringType(), None), +(None, StringType(), ValueError), # Only None test + +# UDT +(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None), +(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError), + +# Boolean +(True, BooleanType(), None), +(1, BooleanType(), TypeError), +("True", BooleanType(), TypeError), +([1], BooleanType(), TypeError), + +# Bytes +(-(2**7) - 1, ByteType(), ValueError), +(-(2**7), ByteType(), None), +(2**7 - 1, ByteType(), None), +(2**7, ByteType(), ValueError), +("1", ByteType(), TypeError), +(1.0, ByteType(), TypeError), + +# Shorts +(-(2**15) - 1, ShortType(), ValueError), +(-(2**15), ShortType(), None), +(2**15 - 1, ShortType(), None), +(2**15, ShortType(), ValueError), + +# Integer +(-(2**31) - 1, IntegerType(), ValueError), +(-(2**31), IntegerType(), None), +(2**31 - 1, IntegerType(), None), +(2**31, IntegerType(), ValueError), + +# Long +(2**64, LongType(), None), + +# Float & Double +(1.0, FloatType(), None), +(1, FloatType(), TypeError), +(1.0, DoubleType(), None), +(1, DoubleType(), TypeError), + +# Decimal +(decimal.Decimal("1.0"), DecimalType(), None), +(1.0, DecimalType(), TypeError), +(1, DecimalType(), TypeError), +("1.0", DecimalType(), TypeError), + +# Binary +(bytearray([1, 2]), BinaryType(), None), +(1, BinaryType(), TypeError), + +# Date/Time +(datetime.date(2000, 1, 2), DateType(), None), +(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None), +("2000-01-02", DateType(), TypeError), +(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None), +(946811040, TimestampType(), TypeError), + +# Array +([], ArrayType(IntegerType()), None), +(["1", None], ArrayType(StringType(), containsNull=True), None), --- End diff -- Added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user dgingrich commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r123099453 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,151 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +_verify_type(obj, data_type, nullable=True) +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +self.assertTrue(True, msg) --- End diff -- Added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r122282700 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,151 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +_verify_type(obj, data_type, nullable=True) +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +self.assertTrue(True, msg) + +def test_verify_type_not_nullable(self): +import array +import datetime +import decimal + +MyStructType = StructType([ +StructField('s', StringType(), nullable=False), +StructField('i', IntegerType(), nullable=True)]) + +class MyObj: +def __init__(self, **ka): +for k, v in ka.items(): +setattr(self, k, v) + +# obj, data_type, exception (None for success or Exception subclass for error) +spec = [ +# Strings (match anything but None) +("", StringType(), None), +(u"", StringType(), None), +(1, StringType(), None), +(1.0, StringType(), None), +([], StringType(), None), +({}, StringType(), None), +(None, StringType(), ValueError), # Only None test + +# UDT +(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None), +(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError), + +# Boolean +(True, BooleanType(), None), +(1, BooleanType(), TypeError), +("True", BooleanType(), TypeError), +([1], BooleanType(), TypeError), + +# Bytes +(-(2**7) - 1, ByteType(), ValueError), +(-(2**7), ByteType(), None), +(2**7 - 1, ByteType(), None), +(2**7, ByteType(), ValueError), +("1", ByteType(), TypeError), +(1.0, ByteType(), TypeError), + +# Shorts +(-(2**15) - 1, ShortType(), ValueError), +(-(2**15), ShortType(), None), +(2**15 - 1, ShortType(), None), +(2**15, ShortType(), ValueError), + +# Integer +(-(2**31) - 1, IntegerType(), ValueError), +(-(2**31), IntegerType(), None), +(2**31 - 1, IntegerType(), None), +(2**31, IntegerType(), ValueError), + +# Long +(2**64, LongType(), None), + +# Float & Double +(1.0, FloatType(), None), +(1, FloatType(), TypeError), +(1.0, DoubleType(), None), +(1, DoubleType(), TypeError), + +# Decimal +(decimal.Decimal("1.0"), DecimalType(), None), +(1.0, DecimalType(), TypeError), +(1, DecimalType(), TypeError), +("1.0", DecimalType(), TypeError), + +# Binary +(bytearray([1, 2]), BinaryType(), None), +(1, BinaryType(), TypeError), + +# Date/Time +(datetime.date(2000, 1, 2), DateType(), None), +(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None), +("2000-01-02", DateType(), TypeError), +(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None), +(946811040, TimestampType(), TypeError), + +# Array +([], ArrayType(IntegerType()), None), +(["1", None], ArrayType(StringType(), containsNull=True), None), +([1, 2], ArrayType(IntegerType()), None), +([1, "2"], ArrayType(IntegerType()), TypeError), +((1, 2), ArrayType(IntegerType()), None), +(array.array('h', [1, 2]), ArrayType(IntegerType()), None), + +# Map +({}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError), +({"a": "1"}, MapType(StringType(), IntegerType()), TypeError), +({"a": None}, MapType(StringType(), IntegerType(), valueContainsNull=True), None), --- End diff -- I'd also like you to add `valueContainsNull=False` case. --- If your project is set up for
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r122283340 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,151 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +_verify_type(obj, data_type, nullable=True) +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +self.assertTrue(True, msg) + +def test_verify_type_not_nullable(self): +import array +import datetime +import decimal + +MyStructType = StructType([ +StructField('s', StringType(), nullable=False), +StructField('i', IntegerType(), nullable=True)]) + +class MyObj: +def __init__(self, **ka): +for k, v in ka.items(): +setattr(self, k, v) + +# obj, data_type, exception (None for success or Exception subclass for error) +spec = [ +# Strings (match anything but None) +("", StringType(), None), +(u"", StringType(), None), +(1, StringType(), None), +(1.0, StringType(), None), +([], StringType(), None), +({}, StringType(), None), +(None, StringType(), ValueError), # Only None test + +# UDT +(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None), +(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError), + +# Boolean +(True, BooleanType(), None), +(1, BooleanType(), TypeError), +("True", BooleanType(), TypeError), +([1], BooleanType(), TypeError), + +# Bytes +(-(2**7) - 1, ByteType(), ValueError), +(-(2**7), ByteType(), None), +(2**7 - 1, ByteType(), None), +(2**7, ByteType(), ValueError), +("1", ByteType(), TypeError), +(1.0, ByteType(), TypeError), + +# Shorts +(-(2**15) - 1, ShortType(), ValueError), +(-(2**15), ShortType(), None), +(2**15 - 1, ShortType(), None), +(2**15, ShortType(), ValueError), + +# Integer +(-(2**31) - 1, IntegerType(), ValueError), +(-(2**31), IntegerType(), None), +(2**31 - 1, IntegerType(), None), +(2**31, IntegerType(), ValueError), + +# Long +(2**64, LongType(), None), + +# Float & Double +(1.0, FloatType(), None), +(1, FloatType(), TypeError), +(1.0, DoubleType(), None), +(1, DoubleType(), TypeError), + +# Decimal +(decimal.Decimal("1.0"), DecimalType(), None), +(1.0, DecimalType(), TypeError), +(1, DecimalType(), TypeError), +("1.0", DecimalType(), TypeError), + +# Binary +(bytearray([1, 2]), BinaryType(), None), +(1, BinaryType(), TypeError), + +# Date/Time +(datetime.date(2000, 1, 2), DateType(), None), +(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None), +("2000-01-02", DateType(), TypeError), +(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None), +(946811040, TimestampType(), TypeError), + +# Array +([], ArrayType(IntegerType()), None), +(["1", None], ArrayType(StringType(), containsNull=True), None), +([1, 2], ArrayType(IntegerType()), None), +([1, "2"], ArrayType(IntegerType()), TypeError), +((1, 2), ArrayType(IntegerType()), None), +(array.array('h', [1, 2]), ArrayType(IntegerType()), None), + +# Map +({}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError), +({"a": "1"}, MapType(StringType(), IntegerType()), TypeError), +({"a": None}, MapType(StringType(), IntegerType(), valueContainsNull=True), None), + +# Struct +({"s": "a", "i": 1}, MyStructType, None), +({"s": "a",
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r122282508 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,151 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +_verify_type(obj, data_type, nullable=True) +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +self.assertTrue(True, msg) + +def test_verify_type_not_nullable(self): +import array +import datetime +import decimal + +MyStructType = StructType([ +StructField('s', StringType(), nullable=False), +StructField('i', IntegerType(), nullable=True)]) + +class MyObj: +def __init__(self, **ka): +for k, v in ka.items(): +setattr(self, k, v) + +# obj, data_type, exception (None for success or Exception subclass for error) +spec = [ +# Strings (match anything but None) +("", StringType(), None), +(u"", StringType(), None), +(1, StringType(), None), +(1.0, StringType(), None), +([], StringType(), None), +({}, StringType(), None), +(None, StringType(), ValueError), # Only None test + +# UDT +(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None), +(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError), + +# Boolean +(True, BooleanType(), None), +(1, BooleanType(), TypeError), +("True", BooleanType(), TypeError), +([1], BooleanType(), TypeError), + +# Bytes +(-(2**7) - 1, ByteType(), ValueError), +(-(2**7), ByteType(), None), +(2**7 - 1, ByteType(), None), +(2**7, ByteType(), ValueError), +("1", ByteType(), TypeError), +(1.0, ByteType(), TypeError), + +# Shorts +(-(2**15) - 1, ShortType(), ValueError), +(-(2**15), ShortType(), None), +(2**15 - 1, ShortType(), None), +(2**15, ShortType(), ValueError), + +# Integer +(-(2**31) - 1, IntegerType(), ValueError), +(-(2**31), IntegerType(), None), +(2**31 - 1, IntegerType(), None), +(2**31, IntegerType(), ValueError), + +# Long +(2**64, LongType(), None), + +# Float & Double +(1.0, FloatType(), None), +(1, FloatType(), TypeError), +(1.0, DoubleType(), None), +(1, DoubleType(), TypeError), + +# Decimal +(decimal.Decimal("1.0"), DecimalType(), None), +(1.0, DecimalType(), TypeError), +(1, DecimalType(), TypeError), +("1.0", DecimalType(), TypeError), + +# Binary +(bytearray([1, 2]), BinaryType(), None), +(1, BinaryType(), TypeError), + +# Date/Time +(datetime.date(2000, 1, 2), DateType(), None), +(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None), +("2000-01-02", DateType(), TypeError), +(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None), +(946811040, TimestampType(), TypeError), + +# Array +([], ArrayType(IntegerType()), None), +(["1", None], ArrayType(StringType(), containsNull=True), None), --- End diff -- I'd like you to add `containsNull=False` case too which contains `None` in the list to verify that it raises `ValueError` correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r122273265 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,151 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +_verify_type(obj, data_type, nullable=True) +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +self.assertTrue(True, msg) + +def test_verify_type_not_nullable(self): +import array +import datetime +import decimal + +MyStructType = StructType([ +StructField('s', StringType(), nullable=False), +StructField('i', IntegerType(), nullable=True)]) + +class MyObj: +def __init__(self, **ka): +for k, v in ka.items(): +setattr(self, k, v) + +# obj, data_type, exception (None for success or Exception subclass for error) +spec = [ +# Strings (match anything but None) +("", StringType(), None), +(u"", StringType(), None), +(1, StringType(), None), +(1.0, StringType(), None), +([], StringType(), None), +({}, StringType(), None), +(None, StringType(), ValueError), # Only None test + +# UDT +(ExamplePoint(1.0, 2.0), ExamplePointUDT(), None), +(ExamplePoint(1.0, 2.0), PythonOnlyUDT(), ValueError), + +# Boolean +(True, BooleanType(), None), +(1, BooleanType(), TypeError), +("True", BooleanType(), TypeError), +([1], BooleanType(), TypeError), + +# Bytes +(-(2**7) - 1, ByteType(), ValueError), +(-(2**7), ByteType(), None), +(2**7 - 1, ByteType(), None), +(2**7, ByteType(), ValueError), +("1", ByteType(), TypeError), +(1.0, ByteType(), TypeError), + +# Shorts +(-(2**15) - 1, ShortType(), ValueError), +(-(2**15), ShortType(), None), +(2**15 - 1, ShortType(), None), +(2**15, ShortType(), ValueError), + +# Integer +(-(2**31) - 1, IntegerType(), ValueError), +(-(2**31), IntegerType(), None), +(2**31 - 1, IntegerType(), None), +(2**31, IntegerType(), ValueError), + +# Long +(2**64, LongType(), None), + +# Float & Double +(1.0, FloatType(), None), +(1, FloatType(), TypeError), +(1.0, DoubleType(), None), +(1, DoubleType(), TypeError), + +# Decimal +(decimal.Decimal("1.0"), DecimalType(), None), +(1.0, DecimalType(), TypeError), +(1, DecimalType(), TypeError), +("1.0", DecimalType(), TypeError), + +# Binary +(bytearray([1, 2]), BinaryType(), None), +(1, BinaryType(), TypeError), + +# Date/Time +(datetime.date(2000, 1, 2), DateType(), None), +(datetime.datetime(2000, 1, 2, 3, 4), DateType(), None), +("2000-01-02", DateType(), TypeError), +(datetime.datetime(2000, 1, 2, 3, 4), TimestampType(), None), +(946811040, TimestampType(), TypeError), + +# Array +([], ArrayType(IntegerType()), None), +(["1", None], ArrayType(StringType(), containsNull=True), None), +([1, 2], ArrayType(IntegerType()), None), +([1, "2"], ArrayType(IntegerType()), TypeError), +((1, 2), ArrayType(IntegerType()), None), +(array.array('h', [1, 2]), ArrayType(IntegerType()), None), + +# Map +({}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(StringType(), IntegerType()), None), +({"a": 1}, MapType(IntegerType(), IntegerType()), TypeError), +({"a": "1"}, MapType(StringType(), IntegerType()), TypeError), +({"a": None}, MapType(StringType(), IntegerType(), valueContainsNull=True), None), + +# Struct +({"s": "a", "i": 1}, MyStructType, None), +({"s": "a",
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/17227#discussion_r122287524 --- Diff: python/pyspark/sql/tests.py --- @@ -2367,6 +2380,151 @@ def range_frame_match(): importlib.reload(window) + +class TypesTest(unittest.TestCase): + +def test_verify_type_ok_nullable(self): +for obj, data_type in [ +(None, IntegerType()), +(None, FloatType()), +(None, StringType()), +(None, StructType([]))]: +_verify_type(obj, data_type, nullable=True) +msg = "_verify_type(%s, %s, nullable=True)" % (obj, data_type) +self.assertTrue(True, msg) --- End diff -- I think we should surround `_verify_type(obj, data_type, nullable=True)` with try block and check if it raises an exception or not as the same as we do in `test_verify_type_not_nullable` test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17227: [SPARK-19507][PySpark][SQL] Show field name in _v...
GitHub user dgingrich opened a pull request: https://github.com/apache/spark/pull/17227 [SPARK-19507][PySpark][SQL] Show field name in _verify_type error ## What changes were proposed in this pull request? Add better error messages to _verify_type to track down which columns are not compliant with the schema. ## How was this patch tested? Unit tests (incomplete), doctest, hand inspection in REPL. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dgingrich/spark topic-spark-19507-verify-types Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17227.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17227 commit ad5e5e5e5ed8396efca4c61eb0219fcd5a5e2caf Author: David GingrichDate: 2017-02-28T08:05:00Z Remove "# noqa" comment from docstring commit 5f72a547a948b5c5a787aace52df04bc8503888b Author: David Gingrich Date: 2017-02-28T08:09:59Z WIP: Add name parameter and better debugging to _verify_types * Add name paramter to _verify_types * Include name parameter in debug messages * Build name message for nested structs, arrays, and maps * Add detailed tests to flesh out spec for _verify_types (WIP) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org