[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18444 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128240401 --- Diff: python/pyspark/sql/types.py --- @@ -915,6 +916,93 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. Also, JVM only support signed types, when converting unsigned types, +# keep in mind that it required 1 more bit when stored as singed types. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html +# Reference for JVM's supported integral types: +# http://docs.oracle.com/javase/specs/jvms/se8/html/jvms-2.html#jvms-2.3.1 + +_array_signed_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'h': ctypes.c_short, +'i': ctypes.c_int, +'l': ctypes.c_long, +} + +_array_unsigned_int_typecode_ctype_mappings = { +'B': ctypes.c_ubyte, +'H': ctypes.c_ushort, +'I': ctypes.c_uint, +'L': ctypes.c_ulong +} + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. --- End diff -- nit: Scala type -> Catalyst datatype. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128174597 --- Diff: python/pyspark/sql/types.py --- @@ -915,6 +916,91 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. Also, JVM only support signed types, when converting unsigned types, +# keep in mind that it required 1 more bit when stored as singed types. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html +# Reference for JVM's supported integral types: +# http://docs.oracle.com/javase/specs/jvms/se8/html/jvms-2.html#jvms-2.3.1 + +_array_signed_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'h': ctypes.c_short, +'i': ctypes.c_int, +'l': ctypes.c_long, +} + +_array_unsigned_int_typecode_ctype_mappings = { +'B': ctypes.c_ubyte, +'H': ctypes.c_ushort, +'I': ctypes.c_uint, +'L': ctypes.c_ulong +} + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType + +# The list of all supported array typecodes is stored here +_array_type_mappings = { +# Warning: Actual properties for float and double in C is not specified in C. +# On almost every system supported by both python and JVM, they are IEEE 754 +# single-precision binary floating-point format and IEEE 754 double-precision +# binary floating-point format. And we do assume the same thing here for now. +'f': FloatType, +'d': DoubleType +} + +# compute array typecode mappings for signed integer types +for _typecode in _array_signed_int_typecode_ctype_mappings.keys(): +size = ctypes.sizeof(_array_signed_int_typecode_ctype_mappings[_typecode]) * 8 +dt = _int_size_to_type(size) +if dt is not None: +_array_type_mappings[_typecode] = dt + +# compute array typecode mappings for unsigned integer types +for _typecode in _array_unsigned_int_typecode_ctype_mappings.keys(): +# JVM does not have unsigned types, so use signed types that is at list 1 --- End diff -- nit: typo `at list`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128142573 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Sounds good to me, too. Let's discuss it in the next pr. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128142511 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- https://issues.apache.org/jira/browse/SPARK-21465 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128141018 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- How about doing the following? 1. Make changes to tests as you suggest, make sure we clearly document why for python2 'L' is an exception as comments. 3. Open a new JIRA, explaining the issue about 'L' support. 4. Finish this PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128141384 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- OK. Let me do this. If it is possible, I would like to finish this PR before I travel. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128141295 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Sounds good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128140334 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- > Should we add 'L' as exception for python2 in unsupported types tests, or do we just completely remove unsupported tests? I'd prefer 'L' as exception for Python 2. > Should we test 'L' for python 2? I don't think we should explicitly test this here. I guess we mistakenly happened to support this case as you said. I think I (we contributors) can't make this decision (by rule) to remove this 'L' support ("undefined support status") but I would like to push this PR forward and this is why I was thinking deferring to a separate PR. cc @ueshin, are you online now? WDYT removing out (unexpected I guess) 'L' support in Python 2? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128138629 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- If we fall back to `_infer_type`, then there should be some dirty changes in test cases to make it pass: Consider the following question: 1. Should we add 'L' as exception for python2 in unsupported types tests, or do we just completely remove unsupported tests? 2. Should we test 'L' for python 2? I really like how the tests now are organized and these changes above will makes the test very messy. My opinion is, we are not changing the status of 'L' from "supported" to "unsupported", but from "undefined support status" to "unsupported". If changing from "undefined support status" to "unsupported" sounds bad, instead of making these changes to the test cases, I would rather to solve this problem now and keep the test cases clean. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128135623 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- We could disallow or support this in a separate PR later. I would rather make a safe choice here right now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128134574 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Wait .. we could go decimal maybe ..? Let's just fall back to leave this case as is. I am quite sure that would be safer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128134072 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Okay. Either way is fine to me (falling back to type inference or disallowing unmappable type). Please let me know if any reviewer thinks differently. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128132778 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- @HyukjinKwon what do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128132584 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Actually I don't think 'L' should be supported because there is no type in Scala that can hold a 64bit unsigned integer... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128132120 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- @HyukjinKwon So, the conclusion is, we do support everything we supported before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128131993 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Test result for python 2 is here: ```python Python 2.7.13 (default, Jul 2 2017, 22:24:59) [GCC 7.1.1 20170621] on linux2 Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/07/18 20:59:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/07/18 20:59:26 WARN Utils: Your hostname, archlinux resolves to a loopback address: 127.0.0.1; using 192.168.88.2 instead (on interface eno1) 17/07/18 20:59:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 17/07/18 20:59:29 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Python version 2.7.13 (default, Jul 2 2017 22:24:59) SparkSession available as 'spark'. >>> import array,sys >>> from pyspark import * >>> from pyspark.sql import * >>> >>> def assertCollectSuccess(typecode, value): ... row = Row(myarray=array.array(typecode, [value])) ... df = spark.createDataFrame([row]) ... print(typecode) ... df.show() ... >>> assertCollectSuccess('u',u'a') u +---+ |myarray| +---+ |[a]| +---+ >>> assertCollectSuccess('f',1.2) f +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('d',1.2) d +---+ |myarray| +---+ | [1.2]| +---+ >>> assertCollectSuccess('b',1) b +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('B',1) B +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('h',1) h +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('H',1) H +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('i',1) i +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('I',1) I +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('l',1) l +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('L',1) L +---+ |myarray| +---+ |[1]| +---+ >>> >>> >>> if sys.version_info[0] > 2: ... assertCollectSuccess('q',1) ... assertCollectSuccess('Q',1) ... >>> >>> if sys.version_info[0] < 3: ... assertCollectSuccess('c','a') ... c +---+ |myarray| +---+ |[a]| +---+ >>> ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128131947 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Definitely. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128131853 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Test result of python3 & spark 2.2.0 is here. I will check on python 2 in a few minutes. I paste the whole command line output, it's a bit lengthy. @HyukjinKwon could you please double check if my test is complete? ```python >>> import array,sys >>> from pyspark import * >>> from pyspark.sql import * >>> >>> def assertCollectSuccess(typecode, value): ... row = Row(myarray=array.array(typecode, [value])) ... df = spark.createDataFrame([row]) ... print(typecode) ... df.show() ... >>> assertCollectSuccess('u',u'a') u +---+ |myarray| +---+ |[a]| +---+ >>> assertCollectSuccess('f',1.2) f +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('d',1.2) d +---+ |myarray| +---+ | [1.2]| +---+ >>> assertCollectSuccess('b',1) b +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('B',1) B +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('h',1) h +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('H',1) H +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('i',1) i +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('I',1) I +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('l',1) l +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('L',1) L 17/07/18 20:55:55 ERROR Executor: Exception in task 14.0 in stage 119.0 (TID 799) net.razorvine.pickle.PickleException: unsupported datatype: 64-bits unsigned long at net.razorvine.pickle.objects.ArrayConstructor.constructLongArrayFromUInt64(ArrayConstructor.java:302) at net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:240) at net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:26) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128130893 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -55,13 +55,12 @@ private[spark] object SerDeUtil extends Logging { //{'d', sizeof(double), d_getitem, d_setitem}, //{'\0', 0, 0, 0} /* Sentinel */ // }; -// TODO: support Py_UNICODE with 2 bytes val machineCodes: Map[Char, Int] = if (ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN)) { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, + Map('B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, 'L' -> 11, 'l' -> 13, 'f' -> 15, 'd' -> 17, 'u' -> 21 ) } else { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, + Map('B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, --- End diff -- Yes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128130330 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -55,13 +55,12 @@ private[spark] object SerDeUtil extends Logging { //{'d', sizeof(double), d_getitem, d_setitem}, //{'\0', 0, 0, 0} /* Sentinel */ // }; -// TODO: support Py_UNICODE with 2 bytes val machineCodes: Map[Char, Int] = if (ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN)) { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, + Map('B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, 'L' -> 11, 'l' -> 13, 'f' -> 15, 'd' -> 17, 'u' -> 21 ) } else { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, + Map('B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, --- End diff -- Ah, yea. I meant to say `'c' -> 1` was not ever reachable for sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128129716 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Yes, this worth a double check. Let me do some test on old spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128129350 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- (@zasdfgbnm I edited comments here) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128129280 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- If we fall back to `_infer_type` here, then `array('q')` will be inferred as something like array of Long, and then the user will see an error from `net.razorvine.pickle` saying that bad array typecode `q`, due to `SPARK-21420`, which seems to me more confusing than explicitly fails here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128128844 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Okay. To double check, it covers all the cases we supported before in Spark? If it can be for sure, I am fine with as is. I was trying to leave a sign-off for this reason - fixing a case we never supported before and a bug (assigning the correct type for `array.array`), and all other cases work as was. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128128561 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -55,13 +55,12 @@ private[spark] object SerDeUtil extends Logging { //{'d', sizeof(double), d_getitem, d_setitem}, //{'\0', 0, 0, 0} /* Sentinel */ // }; -// TODO: support Py_UNICODE with 2 bytes val machineCodes: Map[Char, Int] = if (ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN)) { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, + Map('B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, 'L' -> 11, 'l' -> 13, 'f' -> 15, 'd' -> 17, 'u' -> 21 ) } else { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, + Map('B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, --- End diff -- `c` was supported by spark 2.2.0: ```python >>> row = Row(myarray=array('c', ["a"])) >>> df = spark.createDataFrame([row]) >>> df.first()["myarray"][0] u'a' ``` This support I think was because array was infered the same way as list. But after we make changes on the type infer of array, we have to change this accordingly to bring it back to work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128127304 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Is there any reason to do so? I don't think array with unsupported typecode will be correctly serialized or deserialized. In this case, it would be better to raise an TypeError and let the user to pick another type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128124840 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -55,13 +55,12 @@ private[spark] object SerDeUtil extends Logging { //{'d', sizeof(double), d_getitem, d_setitem}, //{'\0', 0, 0, 0} /* Sentinel */ // }; -// TODO: support Py_UNICODE with 2 bytes val machineCodes: Map[Char, Int] = if (ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN)) { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, + Map('B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, 'L' -> 11, 'l' -> 13, 'f' -> 15, 'd' -> 17, 'u' -> 21 ) } else { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, + Map('B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, --- End diff -- Last question to double check. So, 'c' did not ever work in Python 2 but this PR fixes it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128124801 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- How about failling back to type inference in this case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127839034 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1023,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: --- End diff -- And also keys() makes a copy of a key list in python 2 whereas in operator against a dict directly calls `__contains__` if I remember this correctly. Let's leave this out in any event. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127837941 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1023,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: --- End diff -- Yea. It should be O(1) vs O(n) IIRC. Thanks for testing out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127837846 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1023,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: --- End diff -- Yea. It should be O(1) vs O(n) IIRC. Thanks god testing out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127814477 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1023,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: --- End diff -- I don't think `i in my_dict` and `i in my_dict.keys()` has much difference in performance. A simple test ```python from random import random,shuffle from time import time N = 1000 d = {} for i in range(N): d[i] = random() tests = list(range(2*N)) shuffle(tests) time1 = time() dummy = 0 for i in tests: if i in d: dummy += 1 time2 = time() dummy = 0 for i in tests: if i in d.keys(): dummy += 1 time3 = time() print(time2-time1) print(time3-time2) ``` gives ``` 11.389298915863037 12.40167999268 ``` on my MacBook Pro --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127584948 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1023,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: --- End diff -- Oh, I take it back. This is a possibly hot path. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127584880 --- Diff: python/pyspark/sql/tests.py --- @@ -30,6 +30,10 @@ import functools import time import datetime +import array +import math --- End diff -- BTW, this import looks not used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127584800 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -57,11 +57,11 @@ private[spark] object SerDeUtil extends Logging { // }; // TODO: support Py_UNICODE with 2 bytes --- End diff -- Yea.. actually, this is the reason why I don't like to leave a todo. Yes, it looks apparently fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127584731 --- Diff: python/pyspark/sql/types.py --- @@ -915,6 +916,90 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. Also, JVM only support signed types, when converting unsigned types, +# keep in mind that it required 1 more bit when stored as singed types. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html +# Reference for JVM's supported integral types: +# http://docs.oracle.com/javase/specs/jvms/se8/html/jvms-2.html#jvms-2.3.1 + +_array_signed_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'h': ctypes.c_short, +'i': ctypes.c_int, +'l': ctypes.c_long, +} + +_array_unsigned_int_typecode_ctype_mappings = { +'B': ctypes.c_ubyte, +'H': ctypes.c_ushort, +'I': ctypes.c_uint, +'L': ctypes.c_ulong +} + +# TODO: [SPARK-21420] +# Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version_info[0] >= 3: +# _array_signed_int_typecode_ctype_mappings['q'] = ctypes.c_longlong +# _array_unsigned_int_typecode_ctype_mappings['Q'] = ctypes.c_ulonglong --- End diff -- Personally, I don't like leaving a todo in codes. I am pretty sure that we have unresolved (or probably not removed, although already fixed) todos. At least, I cleared few todos before. I guess `SPARK-21420` has a good description. Let's remove this as the commit log and discussion keep the code changes and context. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127584576 --- Diff: python/pyspark/sql/tests.py --- @@ -30,6 +30,10 @@ import functools import time import datetime +import array +import math +import ctypes + --- End diff -- little nit: could we remove this extra newline? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127584600 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1023,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: --- End diff -- Could we do explicitly `_array_type_mappings.keys()` if you are fine? I found it is confusing sometimes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127555013 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -57,11 +57,11 @@ private[spark] object SerDeUtil extends Logging { // }; // TODO: support Py_UNICODE with 2 bytes --- End diff -- Since `array('u')` pass the tests, should this TODO be removed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127521525 --- Diff: python/pyspark/sql/types.py --- @@ -915,6 +916,90 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version_info[0] >= 3: +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") + +# The list of all supported array typecodes is stored here +_array_type_mappings = { +# Warning: Actual properties for float and double in C is not specified in C. +# On almost every system supported by both python and JVM, they are IEEE 754 +# single-precision binary floating-point format and IEEE 754 double-precision +# binary floating-point format. And we do assume the same thing here for now. +'f': FloatType, +'d': DoubleType +} + +# compute array typecode mappings for integer types +for _typecode in _array_int_typecode_ctype_mappings.keys(): +size = ctypes.sizeof(_array_int_typecode_ctype_mappings[_typecode]) * 8 +if _typecode.isupper(): # 1 extra bit is required to store unsigned types --- End diff -- This is actually a good suggestion! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127513810 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,90 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle --- End diff -- I created a JIRA at https://issues.apache.org/jira/browse/SPARK-21420 I updated the comment here and add reference to that JIRA. I didn't remove this todo because I think it might be a help for future fix. Do you prefer to remove this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127420412 --- Diff: python/pyspark/sql/tests.py --- @@ -2312,6 +2317,67 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): --- End diff -- Minor suggestions: ```diff @@ -2327,23 +2327,29 @@ class SQLTests(ReusedPySparkTestCase): # See: https://docs.python.org/2/library/array.html def assertCollectSuccess(typecode, value): -a = array.array(typecode, [value]) -row = Row(myarray=a) +row = Row(myarray=array.array(typecode, [value])) df = self.spark.createDataFrame([row]) -self.assertEqual(df.collect()[0]["myarray"][0], value) +self.assertEqual(df.first()["myarray"][0], value) -supported_types = [] - -# test string types +# supported string types +# +# blabla... "u" will be removed in python 4 blabla... +# and "c" not supported in python 3 blabla ... +supported_string_types = [] if sys.version_info[0] < 4: -supported_types += ['u'] +supported_string_types += ['u'] +# test unicode assertCollectSuccess('u', u"a") if sys.version_info[0] < 3: -supported_types += ['c'] +supported_string_types += ['c'] +# test string assertCollectSuccess('c', "a") +# supported float and double +# +# tests float max min blabla +supported_fractional_types = ['f', 'd'] # test float and double, assuming IEEE 754 floating-point format -supported_types += ['f', 'd'] assertCollectSuccess('f', ctypes.c_float(1e+38).value) assertCollectSuccess('f', ctypes.c_float(1e-38).value) assertCollectSuccess('f', ctypes.c_float(1.123456).value) @@ -2351,33 +2357,48 @@ class SQLTests(ReusedPySparkTestCase): assertCollectSuccess('d', sys.float_info.min) assertCollectSuccess('d', sys.float_info.epsilon) +# supported int types +# +# blabla .. only supported int types.. blabla.. +supported_int_types = list( +set(_array_int_typecode_ctype_mappings.keys()) +.intersection(set(_array_type_mappings.keys( # test int types -supported_int = list(set(_array_int_typecode_ctype_mappings.keys()). - intersection(set(_array_type_mappings.keys( -supported_types += supported_int -for i in supported_int: -ctype = _array_int_typecode_ctype_mappings[i] -if i.isupper(): # unsigned -assertCollectSuccess(i, 2 ** (ctypes.sizeof(ctype) * 8) - 1) -else: # signed +for t in supported_int_types: +ctype = _array_int_typecode_ctype_mappings[t] +if t.isupper(): +# test unsigned int types +assertCollectSuccess(t, 2 ** (ctypes.sizeof(ctype) * 8) - 1) +else: +# test signed int types max_val = 2 ** (ctypes.sizeof(ctype) * 8 - 1) -assertCollectSuccess(i, max_val - 1) -assertCollectSuccess(i, -max_val) - -# make sure that the test case cover all supported types +assertCollectSuccess(t, max_val - 1) +assertCollectSuccess(t, -max_val) + +# all supported types +# +# Make sure all the supported types are blabla ... +supported_types = (supported_string_types + + supported_fractional_types + + supported_int_types) +# test these are all supported types self.assertEqual(set(supported_types), set(_array_type_mappings.keys())) -# test unsupported types +# all unsupported types +# +# ... ... types are not supported in python 2/3 blabla. if sys.version_info[0] < 3: -all_type_codes = set(['c', 'b', 'B', 'u', 'h', 'H', 'i', 'I', 'l', 'L', 'f', 'd']) +all_types = set(['c', 'b', 'B', 'u', 'h', 'H', 'i', 'I', 'l', 'L', 'f', 'd']) else: -all_type_codes = set(array.typecodes) -unsupported_types = all_type_codes - set(supported_types) +all_types = set(array.typecodes) +unsupported_types = all_types -
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127421739 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, § 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version > "2": +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") --- End diff -- I don't think we should rely on exception handling here. Could we do something like the one below? ```python def _int_size_to_type(size): """ Return the Scala type from the size of integers. """ if size <= 8: return ByteType if size <= 16: return ShortType if size <= 32: return IntegerType if size <= 64: return LongType ... # compute array typecode mappings for integer types ... dt = _int_size_to_type(size) if dt is not None: _array_type_mappings[_typecode] = dt else: raise TypeError("not supported type: integer size too large.") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127421891 --- Diff: python/pyspark/sql/types.py --- @@ -915,6 +916,90 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version_info[0] >= 3: +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") + +# The list of all supported array typecodes is stored here +_array_type_mappings = { +# Warning: Actual properties for float and double in C is not specified in C. +# On almost every system supported by both python and JVM, they are IEEE 754 +# single-precision binary floating-point format and IEEE 754 double-precision +# binary floating-point format. And we do assume the same thing here for now. +'f': FloatType, +'d': DoubleType +} + +# compute array typecode mappings for integer types +for _typecode in _array_int_typecode_ctype_mappings.keys(): +size = ctypes.sizeof(_array_int_typecode_ctype_mappings[_typecode]) * 8 +if _typecode.isupper(): # 1 extra bit is required to store unsigned types --- End diff -- I would rather make a separate list for unsugned types rather than adding a if here and in the test (not strong opinion). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r126072461 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,90 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle --- End diff -- Could we open a JIRA with the suggested comments below and remove this todo and comments here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r126072745 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,90 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version_info[0] >= 3: +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") + +# The list of all supported array typecodes is stored here +_array_type_mappings = { +# Warning: Actual properties for float and double in C is not specified in C. +# On almost every system supported by both python and JVM, they are IEEE 754 +# single-precision binary floating-point format and IEEE 754 double-precision +# binary floating-point format. And we do assume the same thing here for now. +'f': FloatType, +'d': DoubleType +} + +# compute array typecode mappings for integer types +for _typecode in _array_int_typecode_ctype_mappings.keys(): +size = ctypes.sizeof(_array_int_typecode_ctype_mappings[_typecode]) * 8 +if _typecode.isupper(): # 1 extra bit is required to store unsigned types +size += 1 +try: +_array_type_mappings.update({ +_typecode: _int_size_to_type(size) +}) --- End diff -- How about `_array_type_mappings[_typecode] = _int_size_to_type(size)`? (and for other same instances here) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r126809517 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -72,7 +72,11 @@ private[spark] object SerDeUtil extends Logging { val typecode = args(0).asInstanceOf[String].charAt(0) // This must be ISO 8859-1 / Latin 1, not UTF-8, to interoperate correctly val data = args(1).asInstanceOf[String].getBytes(StandardCharsets.ISO_8859_1) --- End diff -- Can anyone explain why `ISO_8859_1` is used here instead of UTF16 or UTF32? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r126806942 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -72,7 +72,11 @@ private[spark] object SerDeUtil extends Logging { val typecode = args(0).asInstanceOf[String].charAt(0) // This must be ISO 8859-1 / Latin 1, not UTF-8, to interoperate correctly val data = args(1).asInstanceOf[String].getBytes(StandardCharsets.ISO_8859_1) -construct(typecode, machineCodes(typecode), data) +val machine_code = machineCodes(typecode) +// fix data alignment +val unit_length = if (machine_code==18 || machine_code==19) 2 else 4 +val aligned_data = data ++ Array.fill[Byte](unit_length - data.length % unit_length)(0) --- End diff -- Not done yet. I think this will only works on little endian --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125910692 --- Diff: python/pyspark/sql/types.py --- @@ -958,12 +1043,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), True) --- End diff -- Yes, you are right ```python >>> from array import * >>> for i in set(typecodes): ... try: ... array(i,[None]) ... print(i,'success') ... except: ... print(i,'fail') ... L fail I fail b fail B fail i fail H fail q fail Q fail f fail l fail u fail h fail d fail ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125855962 --- Diff: python/pyspark/sql/types.py --- @@ -958,12 +1043,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), True) --- End diff -- Btw, can `array` have `None` value? If not, we can use `False` for `containsNull` of `ArrayType`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125642042 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, § 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version > "2": +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") --- End diff -- fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125173526 --- Diff: python/pyspark/sql/tests.py --- @@ -2250,6 +2256,67 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): +# This test need to make sure that the Scala type selected is at least +# as large as the python's types. This is necessary because python's +# array types depend on C implementation on the machine. Therefore there +# is no machine independent correspondence between python's array types +# and Scala types. +# See: https://docs.python.org/2/library/array.html + +def assertCollectSuccess(typecode, value): +a = array.array(typecode, [value]) +row = Row(myarray=a) +df = self.spark.createDataFrame([row]) +self.assertEqual(df.collect()[0]["myarray"][0], value) + +supported_types = [] + +# test string types +if sys.version < "4": --- End diff -- Yes, this looks better I think. I will change that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125173517 --- Diff: python/pyspark/sql/tests.py --- @@ -2250,6 +2256,67 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): +# This test need to make sure that the Scala type selected is at least +# as large as the python's types. This is necessary because python's +# array types depend on C implementation on the machine. Therefore there +# is no machine independent correspondence between python's array types +# and Scala types. +# See: https://docs.python.org/2/library/array.html + +def assertCollectSuccess(typecode, value): +a = array.array(typecode, [value]) +row = Row(myarray=a) +df = self.spark.createDataFrame([row]) +self.assertEqual(df.collect()[0]["myarray"][0], value) + +supported_types = [] + +# test string types +if sys.version < "4": +supported_types += ['u'] +assertCollectSuccess('u', "a") +if sys.version < "3": +supported_types += ['c'] +assertCollectSuccess('c', "a") + +# test float and double, assuming IEEE 754 floating-point format +supported_types += ['f', 'd'] +assertCollectSuccess('f', ctypes.c_float(1e+38).value) +assertCollectSuccess('f', ctypes.c_float(1e-38).value) +assertCollectSuccess('f', ctypes.c_float(1.123456).value) +assertCollectSuccess('d', ctypes.c_double(1e+308).value) +assertCollectSuccess('d', ctypes.c_double(1e+308).value) +assertCollectSuccess('d', ctypes.c_double(1.123456789012345).value) + +# test int types +supported_int = list(set(_array_int_typecode_ctype_mappings.keys()). + intersection(set(_array_type_mappings.keys( +supported_types += supported_int +for i in supported_int: +ctype = _array_int_typecode_ctype_mappings[i] +if i.isupper(): --- End diff -- Yes, good idea. I will change that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125173508 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, § 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version > "2": +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") --- End diff -- But I think I would add a simple line of comment on this to make the code more readable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125173496 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, § 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version > "2": +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") --- End diff -- I don't think we should log this. This is just a helper function that helps to construct `_array_type_mappings`, which is a complete list of all supported type codes. Being filtered out here is not an error, it's by design. If users try to use unsupported typecode, they will see another `TypeError` due to line 1052: ```python . elif isinstance(obj, array): if obj.typecode in _array_type_mappings: return ArrayType(_array_type_mappings[obj.typecode](), True) else: raise TypeError("not supported type: array(%s)" % obj.typecode) . ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125173449 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, § 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version > "2": +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") + +_array_type_mappings = { +# Warning: Actual properties for float and double in C is not unspecified. --- End diff -- Yes, you are correct. Thanks for figuring this out. I just checked, `sys.float_info` is the info for C type double, not for C type float: ```python >>> import sys >>> sys.float_info.max 1.7976931348623157e+308 >>> sys.float_info.dig 15 ``` So we can not use this to check range for C float. But this is not the main reason I'm not using it. The main reason is: Although C does not specify that we have to use IEEE-754 floating point types, all the C platform I have ever seen uses IEEE-754. (Also there is a StackOverflow question about this: https://stackoverflow.com/questions/31967040/is-it-safe-to-assume-floating-point-is-represented-using-ieee754-floats-in-c ) I don't even know if there exists a platform in the world that: python is supported, JVM is supported, and floating point types in C does not use IEEE-754. So, I think it would be OK to assume that these types are IEEE-754 for now to make the code cleaner. It does not worth any effort to support something that might not even exist. But I'm not an expert on this either, so if you know someone that might know more on this, please ping them to double check. On the other hand, If there do exist users that use these platform find that this is a wrong assumption, they can report a new bug to fix this. But yes, my comment seems to be confusing and I will try if I can make it clearer. Also, thank you for pointing out the `sys.float_info`, although I don't think I need to use it here, it would be very useful in test cases. I will change part of my test cases to use it to make code more readable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125172166 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, § 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version > "2": +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") --- End diff -- So in this case we are silently filtering out integer types from supported conversions - do we expect this to happen? Should we log this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125171058 --- Diff: python/pyspark/sql/tests.py --- @@ -2250,6 +2256,67 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): +# This test need to make sure that the Scala type selected is at least +# as large as the python's types. This is necessary because python's +# array types depend on C implementation on the machine. Therefore there +# is no machine independent correspondence between python's array types +# and Scala types. +# See: https://docs.python.org/2/library/array.html + +def assertCollectSuccess(typecode, value): +a = array.array(typecode, [value]) +row = Row(myarray=a) +df = self.spark.createDataFrame([row]) +self.assertEqual(df.collect()[0]["myarray"][0], value) + +supported_types = [] + +# test string types +if sys.version < "4": +supported_types += ['u'] +assertCollectSuccess('u', "a") +if sys.version < "3": +supported_types += ['c'] +assertCollectSuccess('c', "a") + +# test float and double, assuming IEEE 754 floating-point format +supported_types += ['f', 'd'] +assertCollectSuccess('f', ctypes.c_float(1e+38).value) +assertCollectSuccess('f', ctypes.c_float(1e-38).value) +assertCollectSuccess('f', ctypes.c_float(1.123456).value) +assertCollectSuccess('d', ctypes.c_double(1e+308).value) +assertCollectSuccess('d', ctypes.c_double(1e+308).value) +assertCollectSuccess('d', ctypes.c_double(1.123456789012345).value) + +# test int types +supported_int = list(set(_array_int_typecode_ctype_mappings.keys()). + intersection(set(_array_type_mappings.keys( +supported_types += supported_int +for i in supported_int: +ctype = _array_int_typecode_ctype_mappings[i] +if i.isupper(): --- End diff -- In the code that makes the mapping you have a comment about isupper being unsigned, for Scala SQL devs who maybe have to debug this in the future, I'd duplicate this comment here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125171044 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, § 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version > "2": +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") + +_array_type_mappings = { +# Warning: Actual properties for float and double in C is not unspecified. --- End diff -- So I think you mean to say "is not specified" On the other hand, is there a reason we couldn't use sys.float_info to determine the ranges on this? I have not done a lot of poking with float arrays in Python though so just a question. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125170990 --- Diff: python/pyspark/sql/tests.py --- @@ -2250,6 +2256,67 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): +# This test need to make sure that the Scala type selected is at least +# as large as the python's types. This is necessary because python's +# array types depend on C implementation on the machine. Therefore there +# is no machine independent correspondence between python's array types +# and Scala types. +# See: https://docs.python.org/2/library/array.html + +def assertCollectSuccess(typecode, value): +a = array.array(typecode, [value]) +row = Row(myarray=a) +df = self.spark.createDataFrame([row]) +self.assertEqual(df.collect()[0]["myarray"][0], value) + +supported_types = [] + +# test string types +if sys.version < "4": --- End diff -- consider maybe using sys.version_info? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r124676549 --- Diff: python/pyspark/sql/types.py --- @@ -958,12 +968,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), True) --- End diff -- I meant https://github.com/apache/spark/commit/67c75021c59d93cda9b5d70c0ef6d547fff92083. I think it is okay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r124572922 --- Diff: python/pyspark/sql/types.py --- @@ -958,12 +968,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), True) --- End diff -- @HyukjinKwon Could you send me the link to "the Pandas type related PR we helped review before"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r124536696 --- Diff: python/pyspark/sql/tests.py --- @@ -2259,6 +2261,69 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): +# This test need to make sure that the Scala type selected is at least +# as large as the python's types. This is necessary because python's +# array types depend on C implementation on the machine. Therefore there +# is no machine independent correspondence between python's array types +# and Scala types. +# See: https://docs.python.org/2/library/array.html --- End diff -- @ueshin The answer to your question is explained here, a couple lines of comments I just added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r124531036 --- Diff: python/pyspark/sql/tests.py --- @@ -2259,6 +2261,60 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): +int_types = set(['b', 'h', 'i', 'l']) +float_types = set(['f', 'd']) +unsupported_types = set(array.typecodes) - int_types - float_types + +def collected(a): +row = Row(myarray=a) +rdd = self.sc.parallelize([row]) +df = self.spark.createDataFrame(rdd) +return df.collect()[0]["myarray"][0] +# test whether pyspark can correctly handle int types +for t in int_types: +# test positive numbers +a = array.array(t, [1]) +while True: --- End diff -- Let me figure that out. This patch was more than a year ago and I forget what was happening. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r124489677 --- Diff: python/pyspark/sql/types.py --- @@ -958,12 +968,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), True) --- End diff -- Thanks for cc'ing me. If we say `float` being `double` is a bug, I think this is the same instance of the Pandas type related PRs we helped review before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r124464895 --- Diff: python/pyspark/sql/tests.py --- @@ -2259,6 +2261,60 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): +int_types = set(['b', 'h', 'i', 'l']) +float_types = set(['f', 'd']) +unsupported_types = set(array.typecodes) - int_types - float_types + +def collected(a): +row = Row(myarray=a) +rdd = self.sc.parallelize([row]) +df = self.spark.createDataFrame(rdd) +return df.collect()[0]["myarray"][0] +# test whether pyspark can correctly handle int types +for t in int_types: +# test positive numbers +a = array.array(t, [1]) +while True: --- End diff -- What is this loop for? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r124465362 --- Diff: python/pyspark/sql/types.py --- @@ -958,12 +968,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), True) --- End diff -- I'm not sure if we can change the element type. Doesn't this break backward-compatibility? cc @HyukjinKwon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
GitHub user zasdfgbnm opened a pull request: https://github.com/apache/spark/pull/18444 [SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating DataFrame using python ## What changes were proposed in this pull request? This is the reopen of https://github.com/apache/spark/pull/14198, with merge conflicts resolved. @ueshin Could you please take a look at my code? Fix bugs about types that result an array of null when creating DataFrame using python. Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows. A simple code to reproduce this bug is: ``` from pyspark import SparkContext from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show() ``` which have output ``` +---+--+ |doublearray|floatarray| +---+--+ |[1.0, 2.0, 3.0]|[null, null, null]| +---+--+ ``` ## How was this patch tested? New test case added You can merge this pull request into a Git repository by running: $ git pull https://github.com/zasdfgbnm/spark fix_array_infer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18444.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18444 commit a127486d59528eae452dcbcc2ccfb68fdd7769b7 Author: Xiang GaoDate: 2016-07-09T00:58:14Z use array.typecode to infer type Python's array has more type than python it self, for example python only has float while array support 'f' (float) and 'd' (double) Switching to array.typecode helps spark make a better inference For example, for the code: from pyspark.sql.types import _infer_type from array import array a = array('f',[1,2,3,4,5,6]) _infer_type(a) We will get ArrayType(DoubleType,true) before change, but ArrayType(FloatType,true) after change commit 70131f3b81575edf9073d5be72553730d6316bd6 Author: Xiang Gao Date: 2016-07-09T06:21:31Z Merge branch 'master' into fix_array_infer commit 505e819f415c2f754b5147908516ace6f6ddfe78 Author: Xiang Gao Date: 2016-07-13T12:53:18Z sync with upstream commit 05979ca6eabf723cf3849ec2bf6f6e9de26cb138 Author: Xiang Gao Date: 2016-07-14T08:07:12Z add case (c: Float, FloatType) to fromJava commit 5cd817a4e7ec68a693ee2a878a2e36b09b1965b6 Author: Xiang Gao Date: 2016-07-14T08:09:25Z sync with upstream commit cd2ec6bc707fb6e7255b3a6a6822c3667866c63c Author: Xiang Gao Date: 2016-10-17T02:44:48Z add test for array in dataframe commit 527d969067e447f8bff6004570c27130346cdf76 Author: Xiang Gao Date: 2016-10-17T03:13:47Z merge with upstream/master commit 82223c02082793b899c7eeca70f7bbfcea516c28 Author: Xiang Gao Date: 2016-10-17T03:35:47Z set unsigned types and Py_UNICODE as unsupported commit 0a967e280b3250bf7217e61905ad28f010c4ed40 Author: Xiang Gao Date: 2016-10-17T17:46:35Z fix code style commit 2059435b45ed1f6337a4f935adcd029084cfec91 Author: Xiang Gao Date: 2016-10-18T00:11:05Z fix the same problem for byte and short commit 58b120c4d207d9332e6dcde20109651ad8e17190 Author: Xiang Gao Date: 2017-06-28T01:28:03Z sync with upstream --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org