[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 We just change the dirty hacking to something like: ``` if python2.7 and not pypy: do the dirty hacking fool the unit test ``` _Sent from my OnePlus ONEPLUS A3000 using [FastHub](https://play.google.com/store/apps/details?id=com.fastaccess.github)_ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 @ueshin Maybe we don't have to do the same thing. If this is the problem, then I think in pypy environment, 'L' was originally unsupported and there is no need to support it now. _Sent from my OnePlus ONEPLUS A3000 using [FastHub](https://play.google.com/store/apps/details?id=com.fastaccess.github)_ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 Hmm, how was 'L' supported in master python 2? Why there was no such an error there? _Sent from my OnePlus ONEPLUS A3000 using [FastHub](https://play.google.com/store/apps/details?id=com.fastaccess.github)_ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 Looks to be an unrelated error. _Sent from my OnePlus ONEPLUS A3000 using [FastHub](https://play.google.com/store/apps/details?id=com.fastaccess.github)_ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 @HyukjinKwon Take a look at my newest commit. I think I find a better way to solve the problem that keeps all the hacking code for `SPARK-21465` in a single place, making it easier to be removed in the future. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128142511 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- https://issues.apache.org/jira/browse/SPARK-21465 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128141018 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- How about doing the following? 1. Make changes to tests as you suggest, make sure we clearly document why for python2 'L' is an exception as comments. 3. Open a new JIRA, explaining the issue about 'L' support. 4. Finish this PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128141384 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- OK. Let me do this. If it is possible, I would like to finish this PR before I travel. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128138629 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- If we fall back to `_infer_type`, then there should be some dirty changes in test cases to make it pass: Consider the following question: 1. Should we add 'L' as exception for python2 in unsupported types tests, or do we just completely remove unsupported tests? 2. Should we test 'L' for python 2? I really like how the tests now are organized and these changes above will makes the test very messy. My opinion is, we are not changing the status of 'L' from "supported" to "unsupported", but from "undefined support status" to "unsupported". If changing from "undefined support status" to "unsupported" sounds bad, instead of making these changes to the test cases, I would rather to solve this problem now and keep the test cases clean. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128132778 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- @HyukjinKwon what do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128132584 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Actually I don't think 'L' should be supported because there is no type in Scala that can hold a 64bit unsigned integer... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128132120 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- @HyukjinKwon So, the conclusion is, we do support everything we supported before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128131993 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Test result for python 2 is here: ```python Python 2.7.13 (default, Jul 2 2017, 22:24:59) [GCC 7.1.1 20170621] on linux2 Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/07/18 20:59:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/07/18 20:59:26 WARN Utils: Your hostname, archlinux resolves to a loopback address: 127.0.0.1; using 192.168.88.2 instead (on interface eno1) 17/07/18 20:59:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 17/07/18 20:59:29 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Python version 2.7.13 (default, Jul 2 2017 22:24:59) SparkSession available as 'spark'. >>> import array,sys >>> from pyspark import * >>> from pyspark.sql import * >>> >>> def assertCollectSuccess(typecode, value): ... row = Row(myarray=array.array(typecode, [value])) ... df = spark.createDataFrame([row]) ... print(typecode) ... df.show() ... >>> assertCollectSuccess('u',u'a') u +---+ |myarray| +---+ |[a]| +---+ >>> assertCollectSuccess('f',1.2) f +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('d',1.2) d +---+ |myarray| +---+ | [1.2]| +---+ >>> assertCollectSuccess('b',1) b +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('B',1) B +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('h',1) h +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('H',1) H +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('i',1) i +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('I',1) I +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('l',1) l +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('L',1) L +---+ |myarray| +---+ |[1]| +---+ >>> >>> >>> if sys.version_info[0] > 2: ... assertCollectSuccess('q',1) ... assertCollectSuccess('Q',1) ... >>> >>> if sys.version_info[0] < 3: ... assertCollectSuccess('c','a') ... c +---+ |myarray| +---+ |[a]| +---+ >>> ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128131853 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Test result of python3 & spark 2.2.0 is here. I will check on python 2 in a few minutes. I paste the whole command line output, it's a bit lengthy. @HyukjinKwon could you please double check if my test is complete? ```python >>> import array,sys >>> from pyspark import * >>> from pyspark.sql import * >>> >>> def assertCollectSuccess(typecode, value): ... row = Row(myarray=array.array(typecode, [value])) ... df = spark.createDataFrame([row]) ... print(typecode) ... df.show() ... >>> assertCollectSuccess('u',u'a') u +---+ |myarray| +---+ |[a]| +---+ >>> assertCollectSuccess('f',1.2) f +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('d',1.2) d +---+ |myarray| +---+ | [1.2]| +---+ >>> assertCollectSuccess('b',1) b +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('B',1) B +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('h',1) h +---+ |myarray| +---+ | [null]| +---+ >>> assertCollectSuccess('H',1) H +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('i',1) i +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('I',1) I +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('l',1) l +---+ |myarray| +---+ |[1]| +---+ >>> assertCollectSuccess('L',1) L 17/07/18 20:55:55 ERROR Executor: Exception in task 14.0 in stage 119.0 (TID 799) net.razorvine.pickle.PickleException: unsupported datatype: 64-bits unsigned long at net.razorvine.pickle.objects.ArrayConstructor.constructLongArrayFromUInt64(ArrayConstructor.java:302) at net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:240) at net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:26) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.s
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128130893 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -55,13 +55,12 @@ private[spark] object SerDeUtil extends Logging { //{'d', sizeof(double), d_getitem, d_setitem}, //{'\0', 0, 0, 0} /* Sentinel */ // }; -// TODO: support Py_UNICODE with 2 bytes val machineCodes: Map[Char, Int] = if (ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN)) { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, + Map('B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, 'L' -> 11, 'l' -> 13, 'f' -> 15, 'd' -> 17, 'u' -> 21 ) } else { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, + Map('B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, --- End diff -- Yes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 By the way, I'm traveling tomorrow and will be back next Tuesday. During traveling, I may not be able to answer any comments, questions, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128129716 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Yes, this worth a double check. Let me do some test on old spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128129280 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- If we fall back to `_infer_type` here, then `array('q')` will be inferred as something like array of Long, and then the user will see an error from `net.razorvine.pickle` saying that bad array typecode `q`, due to `SPARK-21420`, which seems to me more confusing than explicitly fails here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128128561 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -55,13 +55,12 @@ private[spark] object SerDeUtil extends Logging { //{'d', sizeof(double), d_getitem, d_setitem}, //{'\0', 0, 0, 0} /* Sentinel */ // }; -// TODO: support Py_UNICODE with 2 bytes val machineCodes: Map[Char, Int] = if (ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN)) { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, + Map('B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9, 'L' -> 11, 'l' -> 13, 'f' -> 15, 'd' -> 17, 'u' -> 21 ) } else { - Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, + Map('B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, --- End diff -- `c` was supported by spark 2.2.0: ```python >>> row = Row(myarray=array('c', ["a"])) >>> df = spark.createDataFrame([row]) >>> df.first()["myarray"][0] u'a' ``` This support I think was because array was infered the same way as list. But after we make changes on the type infer of array, we have to change this accordingly to bring it back to work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r128127304 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1016,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), False) +else: +raise TypeError("not supported type: array(%s)" % obj.typecode) --- End diff -- Is there any reason to do so? I don't think array with unsupported typecode will be correctly serialized or deserialized. In this case, it would be better to raise an TypeError and let the user to pick another type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 @HyukjinKwon all done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127814477 --- Diff: python/pyspark/sql/types.py --- @@ -938,12 +1023,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: --- End diff -- I don't think `i in my_dict` and `i in my_dict.keys()` has much difference in performance. A simple test ```python from random import random,shuffle from time import time N = 1000 d = {} for i in range(N): d[i] = random() tests = list(range(2*N)) shuffle(tests) time1 = time() dummy = 0 for i in tests: if i in d: dummy += 1 time2 = time() dummy = 0 for i in tests: if i in d.keys(): dummy += 1 time3 = time() print(time2-time1) print(time3-time2) ``` gives ``` 11.389298915863037 12.40167999268 ``` on my MacBook Pro --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127555013 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -57,11 +57,11 @@ private[spark] object SerDeUtil extends Logging { // }; // TODO: support Py_UNICODE with 2 bytes --- End diff -- Since `array('u')` pass the tests, should this TODO be removed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 I updated my code according to @HyukjinKwon's suggestion --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127521525 --- Diff: python/pyspark/sql/types.py --- @@ -915,6 +916,90 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version_info[0] >= 3: +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") + +# The list of all supported array typecodes is stored here +_array_type_mappings = { +# Warning: Actual properties for float and double in C is not specified in C. +# On almost every system supported by both python and JVM, they are IEEE 754 +# single-precision binary floating-point format and IEEE 754 double-precision +# binary floating-point format. And we do assume the same thing here for now. +'f': FloatType, +'d': DoubleType +} + +# compute array typecode mappings for integer types +for _typecode in _array_int_typecode_ctype_mappings.keys(): +size = ctypes.sizeof(_array_int_typecode_ctype_mappings[_typecode]) * 8 +if _typecode.isupper(): # 1 extra bit is required to store unsigned types --- End diff -- This is actually a good suggestion! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r127513810 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,90 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle --- End diff -- I created a JIRA at https://issues.apache.org/jira/browse/SPARK-21420 I updated the comment here and add reference to that JIRA. I didn't remove this todo because I think it might be a help for future fix. Do you prefer to remove this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 Finally, all tests pass... I'm happy that my test case helps find two more bugs and thank @ueshin for fixing these two bugs :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r126809517 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -72,7 +72,11 @@ private[spark] object SerDeUtil extends Logging { val typecode = args(0).asInstanceOf[String].charAt(0) // This must be ISO 8859-1 / Latin 1, not UTF-8, to interoperate correctly val data = args(1).asInstanceOf[String].getBytes(StandardCharsets.ISO_8859_1) --- End diff -- Can anyone explain why `ISO_8859_1` is used here instead of UTF16 or UTF32? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r126806942 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -72,7 +72,11 @@ private[spark] object SerDeUtil extends Logging { val typecode = args(0).asInstanceOf[String].charAt(0) // This must be ISO 8859-1 / Latin 1, not UTF-8, to interoperate correctly val data = args(1).asInstanceOf[String].getBytes(StandardCharsets.ISO_8859_1) -construct(typecode, machineCodes(typecode), data) +val machine_code = machineCodes(typecode) +// fix data alignment +val unit_length = if (machine_code==18 || machine_code==19) 2 else 4 +val aligned_data = data ++ Array.fill[Byte](unit_length - data.length % unit_length)(0) --- End diff -- Not done yet. I think this will only works on little endian --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 For some reason, I can not reproduce the error on my machine. I run the test using the following command: ```bash PYSPARK_PYTHON=$(which python2) ./bin/spark-submit python/pyspark/sql/tests.py SQLTests.test_array_types 2>/dev/null ``` and I always get a pass... So I have to commit and push to let Jenkins run the test to see if it will pass... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 ``` net.razorvine.pickle.PickleException: for c/u type must be 18/19/20/21 at net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:154) at org.apache.spark.api.python.SerDeUtil$ArrayConstructor.construct(SerDeUtil.scala:75) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:162) ``` strange --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125910692 --- Diff: python/pyspark/sql/types.py --- @@ -958,12 +1043,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), True) --- End diff -- Yes, you are right ```python >>> from array import * >>> for i in set(typecodes): ... try: ... array(i,[None]) ... print(i,'success') ... except: ... print(i,'fail') ... L fail I fail b fail B fail i fail H fail q fail Q fail f fail l fail u fail h fail d fail ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 Maybe the bug is in py4j --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 There seems to be another bug in spark on somewhere: ```python rdd = sc.parallelize([array('l',[9223372036854775807])]) print(rdd._to_java_object_rdd().collect()[0][0]) ``` The above code gives -1 on python2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 it is strange... The test pass at python3, but at python2 with typecode `l`, `9223372036854775807` becomes `-1` after creating `DataFrame` and --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125642042 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, § 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version > "2": +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") --- End diff -- fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 test fails due to a `§` in comment... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 I'm not sure if Jenkins would listen to me :( --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125173526 --- Diff: python/pyspark/sql/tests.py --- @@ -2250,6 +2256,67 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): +# This test need to make sure that the Scala type selected is at least +# as large as the python's types. This is necessary because python's +# array types depend on C implementation on the machine. Therefore there +# is no machine independent correspondence between python's array types +# and Scala types. +# See: https://docs.python.org/2/library/array.html + +def assertCollectSuccess(typecode, value): +a = array.array(typecode, [value]) +row = Row(myarray=a) +df = self.spark.createDataFrame([row]) +self.assertEqual(df.collect()[0]["myarray"][0], value) + +supported_types = [] + +# test string types +if sys.version < "4": --- End diff -- Yes, this looks better I think. I will change that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125173517 --- Diff: python/pyspark/sql/tests.py --- @@ -2250,6 +2256,67 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): +# This test need to make sure that the Scala type selected is at least +# as large as the python's types. This is necessary because python's +# array types depend on C implementation on the machine. Therefore there +# is no machine independent correspondence between python's array types +# and Scala types. +# See: https://docs.python.org/2/library/array.html + +def assertCollectSuccess(typecode, value): +a = array.array(typecode, [value]) +row = Row(myarray=a) +df = self.spark.createDataFrame([row]) +self.assertEqual(df.collect()[0]["myarray"][0], value) + +supported_types = [] + +# test string types +if sys.version < "4": +supported_types += ['u'] +assertCollectSuccess('u', "a") +if sys.version < "3": +supported_types += ['c'] +assertCollectSuccess('c', "a") + +# test float and double, assuming IEEE 754 floating-point format +supported_types += ['f', 'd'] +assertCollectSuccess('f', ctypes.c_float(1e+38).value) +assertCollectSuccess('f', ctypes.c_float(1e-38).value) +assertCollectSuccess('f', ctypes.c_float(1.123456).value) +assertCollectSuccess('d', ctypes.c_double(1e+308).value) +assertCollectSuccess('d', ctypes.c_double(1e+308).value) +assertCollectSuccess('d', ctypes.c_double(1.123456789012345).value) + +# test int types +supported_int = list(set(_array_int_typecode_ctype_mappings.keys()). + intersection(set(_array_type_mappings.keys( +supported_types += supported_int +for i in supported_int: +ctype = _array_int_typecode_ctype_mappings[i] +if i.isupper(): --- End diff -- Yes, good idea. I will change that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125173508 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, § 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version > "2": +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") --- End diff -- But I think I would add a simple line of comment on this to make the code more readable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125173496 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, § 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version > "2": +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") --- End diff -- I don't think we should log this. This is just a helper function that helps to construct `_array_type_mappings`, which is a complete list of all supported type codes. Being filtered out here is not an error, it's by design. If users try to use unsupported typecode, they will see another `TypeError` due to line 1052: ```python . elif isinstance(obj, array): if obj.typecode in _array_type_mappings: return ArrayType(_array_type_mappings[obj.typecode](), True) else: raise TypeError("not supported type: array(%s)" % obj.typecode) . ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r125173449 --- Diff: python/pyspark/sql/types.py --- @@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value): long: LongType, }) +# Mapping Python array types to Spark SQL DataType +# We should be careful here. The size of these types in python depends on C +# implementation. We need to make sure that this conversion does not lose any +# precision. +# +# Reference for C integer size, see: +# ISO/IEC 9899:201x specification, § 5.2.4.2.1 Sizes of integer types . +# Reference for python array typecode, see: +# https://docs.python.org/2/library/array.html +# https://docs.python.org/3.6/library/array.html + +_array_int_typecode_ctype_mappings = { +'b': ctypes.c_byte, +'B': ctypes.c_ubyte, +'h': ctypes.c_short, +'H': ctypes.c_ushort, +'i': ctypes.c_int, +'I': ctypes.c_uint, +'l': ctypes.c_long, +'L': ctypes.c_ulong +} + +# TODO: Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle +# Type code 'q' and 'Q' are not available at python 2 +# if sys.version > "2": +# _array_int_typecode_ctype_mappings.update({ +# 'q': ctypes.c_longlong, +# 'Q': ctypes.c_ulonglong +# }) + + +def _int_size_to_type(size): +""" +Return the Scala type from the size of integers. +""" +if size <= 8: +return ByteType +if size <= 16: +return ShortType +if size <= 32: +return IntegerType +if size <= 64: +return LongType +raise TypeError("not supported type: integer size too large.") + +_array_type_mappings = { +# Warning: Actual properties for float and double in C is not unspecified. --- End diff -- Yes, you are correct. Thanks for figuring this out. I just checked, `sys.float_info` is the info for C type double, not for C type float: ```python >>> import sys >>> sys.float_info.max 1.7976931348623157e+308 >>> sys.float_info.dig 15 ``` So we can not use this to check range for C float. But this is not the main reason I'm not using it. The main reason is: Although C does not specify that we have to use IEEE-754 floating point types, all the C platform I have ever seen uses IEEE-754. (Also there is a StackOverflow question about this: https://stackoverflow.com/questions/31967040/is-it-safe-to-assume-floating-point-is-represented-using-ieee754-floats-in-c ) I don't even know if there exists a platform in the world that: python is supported, JVM is supported, and floating point types in C does not use IEEE-754. So, I think it would be OK to assume that these types are IEEE-754 for now to make the code cleaner. It does not worth any effort to support something that might not even exist. But I'm not an expert on this either, so if you know someone that might know more on this, please ping them to double check. On the other hand, If there do exist users that use these platform find that this is a wrong assumption, they can report a new bug to fix this. But yes, my comment seems to be confusing and I will try if I can make it clearer. Also, thank you for pointing out the `sys.float_info`, although I don't think I need to use it here, it would be very useful in test cases. I will change part of my test cases to use it to make code more readable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 @ueshin @HyukjinKwon I think I'm done now. There are still fails in tests, but it doesn't looks to be something related to my change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r124572922 --- Diff: python/pyspark/sql/types.py --- @@ -958,12 +968,17 @@ def _infer_type(obj): return MapType(_infer_type(key), _infer_type(value), True) else: return MapType(NullType(), NullType(), True) -elif isinstance(obj, (list, array)): +elif isinstance(obj, list): for v in obj: if v is not None: return ArrayType(_infer_type(obj[0]), True) else: return ArrayType(NullType(), True) +elif isinstance(obj, array): +if obj.typecode in _array_type_mappings: +return ArrayType(_array_type_mappings[obj.typecode](), True) --- End diff -- @HyukjinKwon Could you send me the link to "the Pandas type related PR we helped review before"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/18444 @HyukjinKwon Thanks for figuring that out. I will fix those issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r124536696 --- Diff: python/pyspark/sql/tests.py --- @@ -2259,6 +2261,69 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): +# This test need to make sure that the Scala type selected is at least +# as large as the python's types. This is necessary because python's +# array types depend on C implementation on the machine. Therefore there +# is no machine independent correspondence between python's array types +# and Scala types. +# See: https://docs.python.org/2/library/array.html --- End diff -- @ueshin The answer to your question is explained here, a couple lines of comments I just added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/18444#discussion_r124531036 --- Diff: python/pyspark/sql/tests.py --- @@ -2259,6 +2261,60 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +# test for SPARK-16542 +def test_array_types(self): +int_types = set(['b', 'h', 'i', 'l']) +float_types = set(['f', 'd']) +unsupported_types = set(array.typecodes) - int_types - float_types + +def collected(a): +row = Row(myarray=a) +rdd = self.sc.parallelize([row]) +df = self.spark.createDataFrame(rdd) +return df.collect()[0]["myarray"][0] +# test whether pyspark can correctly handle int types +for t in int_types: +# test positive numbers +a = array.array(t, [1]) +while True: --- End diff -- Let me figure that out. This patch was more than a year ago and I forget what was happening. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/14198 reopened at https://github.com/apache/spark/pull/18444 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...
GitHub user zasdfgbnm opened a pull request: https://github.com/apache/spark/pull/18444 [SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating DataFrame using python ## What changes were proposed in this pull request? This is the reopen of https://github.com/apache/spark/pull/14198, with merge conflicts resolved. @ueshin Could you please take a look at my code? Fix bugs about types that result an array of null when creating DataFrame using python. Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows. A simple code to reproduce this bug is: ``` from pyspark import SparkContext from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show() ``` which have output ``` +---+--+ |doublearray|floatarray| +---+--+ |[1.0, 2.0, 3.0]|[null, null, null]| +---+--+ ``` ## How was this patch tested? New test case added You can merge this pull request into a Git repository by running: $ git pull https://github.com/zasdfgbnm/spark fix_array_infer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18444.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18444 commit a127486d59528eae452dcbcc2ccfb68fdd7769b7 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-09T00:58:14Z use array.typecode to infer type Python's array has more type than python it self, for example python only has float while array support 'f' (float) and 'd' (double) Switching to array.typecode helps spark make a better inference For example, for the code: from pyspark.sql.types import _infer_type from array import array a = array('f',[1,2,3,4,5,6]) _infer_type(a) We will get ArrayType(DoubleType,true) before change, but ArrayType(FloatType,true) after change commit 70131f3b81575edf9073d5be72553730d6316bd6 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-09T06:21:31Z Merge branch 'master' into fix_array_infer commit 505e819f415c2f754b5147908516ace6f6ddfe78 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-13T12:53:18Z sync with upstream commit 05979ca6eabf723cf3849ec2bf6f6e9de26cb138 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-14T08:07:12Z add case (c: Float, FloatType) to fromJava commit 5cd817a4e7ec68a693ee2a878a2e36b09b1965b6 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-14T08:09:25Z sync with upstream commit cd2ec6bc707fb6e7255b3a6a6822c3667866c63c Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-10-17T02:44:48Z add test for array in dataframe commit 527d969067e447f8bff6004570c27130346cdf76 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-10-17T03:13:47Z merge with upstream/master commit 82223c02082793b899c7eeca70f7bbfcea516c28 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-10-17T03:35:47Z set unsigned types and Py_UNICODE as unsupported commit 0a967e280b3250bf7217e61905ad28f010c4ed40 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-10-17T17:46:35Z fix code style commit 2059435b45ed1f6337a4f935adcd029084cfec91 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-10-18T00:11:05Z fix the same problem for byte and short commit 58b120c4d207d9332e6dcde20109651ad8e17190 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2017-06-28T01:28:03Z sync with upstream --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/14198 @ueshin @gatorsmile I'm happy to resolve the conflicts IF AND ONLY IF there will be a developer work on the code review for this. This PR was opened more than a year ago and I keep waiting for the review for one year. If it is guaranteed that there will be a reviewer assigned for this recently, I will resolve the conflicts. Otherwise, I don't want to maintain a PR forever just to wait for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/14198 Hi @holdenk , I think I'm done. I create a test for this issue and I do find from the test that spark has the same issue not only for float but also for byte and short. After several commits, `./python/run-tests --modules=pyspark-sql` passes on my computer. To be clear, I need to say that only array with typecode `b,h,i,l,f,d` are supported, array with typecode `u` is not supported because it "corresponds to Python’s obsolete unicode character", array with typecode `B,H,I,L` are not supported because there is no unsigned types on JVM, array with typecode `q,Q` are not supported because they "are available only if the platform C compiler used to build Python supports C long long", which makes supporting them complicated. For the unsupported typecodes, a TypeError will be raised if the user try to create a DataFrame of it. Would you, or any other developer, review my code and get it merged? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/14198 Something to mention is, there is still one problem that I'm not sure whether I solve it correctly: in python's array, unsigned types are supported, but unsigned types are not supported in JVM. The solution in this PR is to convert unsigned types to a larger type, e.g. unsigned int -> long. I'm not sure whether it would be better to reject the unsigned types in python and throw an exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14198: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/14198 I'd love to help --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm closed the pull request at: https://github.com/apache/spark/pull/14231 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14231: [SPARK-16586] Change the way the exit code of launcher i...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/14231 Yes this patch looks clearer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r72803075 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- I can not find a neat solution to deal with the `\0`... Another option is to replace the `\0` with `\n` in and store it in a variable. But this will be a problem if the user's path of spark contains a `\n`... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r72562015 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- This won't work. `build_command` is executed in a subshell. The `exit $?` will only terminate the subshell. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r72460695 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- Things can also be implemented like ``` build_command() { "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" LAUNCHER_EXIT_CODE=$? if [ "$LAUNCHER_EXIT_CODE" != 0 ]; then printf "\0"; fi printf "%d\0" $LAUNCHER_EXIT_CODE } ``` Which will minimize the change to the original code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r72458729 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- If you further add a line like `echo ${CMD[@]}` after `if [ "$LAUNCHER_EXIT_CODE" != 0 ]; then`, it will output something like `There is an error1`, which will report the error happened but the ending error code `1` looks ugly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r72457989 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- The original implementation actually fails here at line 83: `if [ $LAUNCHER_EXIT_CODE != 0 ]; then` for example if the launcher fails with an output "There is an error" and exit status 1, then this line becomes `if [ There is an error1 != 0 ]; then`, which will trigger an `[: too many arguments` The easiest modification is add `""` between `$LAUNCHER_EXIT_CODE`, which will make the script fails silently without any output. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r72410689 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- Yes this is what I'm doing in this PR: ''' run launcher, encode it's output and get it's exit status if(exit status is fail) decode the launcher's output and print else #exit code is success process as before ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r72364403 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- The launcher doesn't actually launch anything, instead, it just output a command that should be used to launch the desired class, separated by .`\0`` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r72364159 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- If the launcher fails, it is sufficient to terminate the script and exit with the nonzero `$?`. But if it success, the the output, which contains `\0`, should be used to start a new command. That's why when we execute `"$RUNNER" -Xmx128m ...`, we should try to store both the exit code and the output. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r72276770 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- Since bash don't allow us to store \0 in a variable, we need to find some way to deal with it. One way to deal with it is to run the launcher in a subshell and pipe the output of the subshell to the current shell by using the "read" provided by bash (see the original code line 76-78). In this way, the $? will be the exit status of some command in the while loop of this shell, not the exit code of launcher in the subshell. That's why in the origin implementation, $? is appended in the end. Another way to deal with \0, which is what I have proposed in this PR is, to make a hexdump of the output with '\0' and store the hexdump into a variable and when we just decode it we need to use the output with '\0'. In this way, the $? will be the exit status of "tr" rather than of the launcher. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r72167441 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- If the output of launcher doesn't contain '\0', then the code gonna be easy: ```bash output=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@") LAUNCHER_EXIT_CODE=$? ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r72167136 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- The biggest problem to achieve this is: If success, the exit status of the process will be 0 and the output of the process will be something sperate by '\0'. Bash don't allow us to store '\0' in a bash variable, we should find some alternative way to store it, which makes $? no longer the exit status of the launcher.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r71063901 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- By the way, the previous code assume that the output of the launcher end with a '\0' otherwise it will crash due to the same problem. I read the codes in `org.apache.spark.launcher`. It seems that the main of launcher rarely output something and exit a non-zero. So the only possibility that this is a problem is there are uncaught exceptions or there is something wrong with java. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
Github user zasdfgbnm commented on a diff in the pull request: https://github.com/apache/spark/pull/14231#discussion_r71063792 --- Diff: bin/spark-class --- @@ -65,24 +65,25 @@ fi # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. # -# The exit code of the launcher is appended to the output, so the parent shell removes it from the -# command array and checks the value to see if the launcher succeeded. -build_command() { - "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" - printf "%d\0" $? -} +# To keep both the output and the exit code of the launcher, the output is first converted to a hex +# dump which prevents the bash from getting rid of the NULL character, and the exit code retrieved +# from the bash array ${PIPESTATUS[@]}. +# +# Note that the seperator NULL character can not be replace with space or '\n' so that the command +# won't fail if some path of the user contain special characher such as '\n' or space +# +# Also note that when the launcher fails, it might not output something ending with '\0' [SPARK-16586] +_CMD=$("$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"|xxd -p|tr -d '\n';exit ${PIPESTATUS[0]}) --- End diff -- Yes, you are right. I found this problem when I try to run spark for test on a login node of a HPC, which limit the amount of memory applications can use. This is not some big problem, so it's ok to mark this as "won't fix", but this might be confusing to users who don't understand shell scripts. Actually this patch is a small change and it reduce the total lines of codes (but I agree that this line is a little bit tricky and harder to understand). That's why I add some comments to explain what's happening. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14231: [SPARK-16586] Change the way the exit code of lau...
GitHub user zasdfgbnm opened a pull request: https://github.com/apache/spark/pull/14231 [SPARK-16586] Change the way the exit code of launcher is handled to avoid problem when launcher fails ## What changes were proposed in this pull request? In the spark-class shell script, the exit code of the launcher is appended to the end of the output. However, when the launcher fails, their might not be a '\0' at the end of the launcher's output, which makes the test code `[ $LAUNCHER_EXIT_CODE != 0 ]` abort with a error `[: too many arguments` This patch fixes this bug by changing the way the exit code of the launcher is passed ## How was this patch tested? manually tested You can merge this pull request into a Git repository by running: $ git pull https://github.com/zasdfgbnm/spark fix_spark-class Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14231.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14231 commit 626be035b656750cc10394a1ed8bbfa021e53aa9 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-16T09:00:22Z [SPARK-16586] change the way the exit code of the laucher is handled so that there won't be a problem when the launcher fails --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14198: Fix bugs about types that result an array of null...
GitHub user zasdfgbnm opened a pull request: https://github.com/apache/spark/pull/14198 Fix bugs about types that result an array of null when creating dataframe using python ## What changes were proposed in this pull request? Fix bugs about types that result an array of null when creating dataframe using python. Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows. A simple code to reproduce this is: `from pyspark import SparkContext` `from pyspark.sql import SQLContext,Row,DataFrame` `from array import array` `sc = SparkContext()` `sqlContext = SQLContext(sc)` `row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))` `rows = sc.parallelize([ row1 ])` `df = sqlContext.createDataFrame(rows)` `df.show()` which have output `+---+--+` `|doublearray|floatarray|` `+---+--+` `|[1.0, 2.0, 3.0]|[null, null, null]|` `+---+--+` ## How was this patch tested? tested manually You can merge this pull request into a Git repository by running: $ git pull https://github.com/zasdfgbnm/spark fix_array_infer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14198.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14198 commit a127486d59528eae452dcbcc2ccfb68fdd7769b7 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-09T00:58:14Z use array.typecode to infer type Python's array has more type than python it self, for example python only has float while array support 'f' (float) and 'd' (double) Switching to array.typecode helps spark make a better inference For example, for the code: from pyspark.sql.types import _infer_type from array import array a = array('f',[1,2,3,4,5,6]) _infer_type(a) We will get ArrayType(DoubleType,true) before change, but ArrayType(FloatType,true) after change commit 70131f3b81575edf9073d5be72553730d6316bd6 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-09T06:21:31Z Merge branch 'master' into fix_array_infer commit 505e819f415c2f754b5147908516ace6f6ddfe78 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-13T12:53:18Z sync with upstream commit 05979ca6eabf723cf3849ec2bf6f6e9de26cb138 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-14T08:07:12Z add case (c: Float, FloatType) to fromJava commit 5cd817a4e7ec68a693ee2a878a2e36b09b1965b6 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-14T08:09:25Z sync with upstream --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14027: [SPARK-16352] Spark daemon run.sh
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/14027 OK, but will the status of the original JIRA keep being "won't fix", if someone decide to reopen the discussion on it? Can anyone give a comment on whether the running on foreground feature are welcome in spark upstream? Also, if there has already been a good solution to run in foreground easily enough (so that my PR will be useless), could anyone send me a link to the document describing this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #3881: [SPARK-5964] Allow spark-daemon.sh to support foreground ...
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/3881 Can anyone explain what happened to this PR? Why people close this PR without adding any support for running in foreground? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14027: [SPARK-16352] Spark daemon run.sh
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/14027 I followed that PR (https://github.com/apache/spark/pull/3881) "I was under the impression all the shell scripts were getting refactored and that this patch had become obsolete. I agree it's best to close this out." by @hellertime I'm not sure what happened to that. It seems that the previous PR was closed because the scripts had changed a lot before that PR was merged. I didn't see anything that support running in foreground and nobody mentioned in that issue that running in foreground is useless and will never be supported. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14027: [SPARK-16352] Spark daemon run.sh
Github user zasdfgbnm commented on the issue: https://github.com/apache/spark/pull/14027 I see similar pull request here: https://github.com/apache/spark/pull/3881 But I didn't get what happened to that and why it was closed without adding any support for running in foreground. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14027: Spark daemon run.sh
GitHub user zasdfgbnm opened a pull request: https://github.com/apache/spark/pull/14027 Spark daemon run.sh ## What changes were proposed in this pull request? I add some shell scripts to support running master and slave foreground, which makes it more convenient to write systemd service and run on HPC. ## How was this patch tested? I test it manually and it works well on my computer You can merge this pull request into a Git repository by running: $ git pull https://github.com/zasdfgbnm/spark spark-daemon-run.sh Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14027.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14027 commit 7b99b15656daa430081b1952bc090d3824d916ce Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-01T14:22:05Z split spark-daemon.sh to allow foreground commit 700247e5cae5bfd33d24a54c133eef2b5a28df96 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-01T15:09:46Z fix bugs to make it work commit 6a80b7986f831abaf5cdf23c42528055f4999b86 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-01T15:26:13Z fixes commit 3095187162a821292eabedcd44c18932dbf73ca6 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-01T15:27:11Z starting->running commit 1954640ccca1a1da7903455d67f1abdc2db0fd73 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-01T15:29:21Z add run-master.sh commit b96f4e1c4a28ffc893779221100080c5d31a2508 Author: Xiang Gao <qasdfgtyu...@gmail.com> Date: 2016-07-01T15:36:26Z add support run-slave.sh --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org