subject:"\[GitHub\] spark pull request #18444\: \[SPARK\-16542\]\[SQL\]\[PYSPARK\] Fix bugs about types ..."

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-19 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18444


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-19 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128240401
  
--- Diff: python/pyspark/sql/types.py ---
@@ -915,6 +916,93 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision. Also, JVM only support signed types, when converting unsigned 
types,
+# keep in mind that it required 1 more bit when stored as singed types.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer 
types .
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+# Reference for JVM's supported integral types:
+# http://docs.oracle.com/javase/specs/jvms/se8/html/jvms-2.html#jvms-2.3.1
+
+_array_signed_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'h': ctypes.c_short,
+'i': ctypes.c_int,
+'l': ctypes.c_long,
+}
+
+_array_unsigned_int_typecode_ctype_mappings = {
+'B': ctypes.c_ubyte,
+'H': ctypes.c_ushort,
+'I': ctypes.c_uint,
+'L': ctypes.c_ulong
+}
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
--- End diff --

nit: Scala type -> Catalyst datatype.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-19 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128174597
  
--- Diff: python/pyspark/sql/types.py ---
@@ -915,6 +916,91 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision. Also, JVM only support signed types, when converting unsigned 
types,
+# keep in mind that it required 1 more bit when stored as singed types.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer 
types .
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+# Reference for JVM's supported integral types:
+# http://docs.oracle.com/javase/specs/jvms/se8/html/jvms-2.html#jvms-2.3.1
+
+_array_signed_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'h': ctypes.c_short,
+'i': ctypes.c_int,
+'l': ctypes.c_long,
+}
+
+_array_unsigned_int_typecode_ctype_mappings = {
+'B': ctypes.c_ubyte,
+'H': ctypes.c_ushort,
+'I': ctypes.c_uint,
+'L': ctypes.c_ulong
+}
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
+"""
+if size <= 8:
+return ByteType
+if size <= 16:
+return ShortType
+if size <= 32:
+return IntegerType
+if size <= 64:
+return LongType
+
+# The list of all supported array typecodes is stored here
+_array_type_mappings = {
+# Warning: Actual properties for float and double in C is not 
specified in C.
+# On almost every system supported by both python and JVM, they are 
IEEE 754
+# single-precision binary floating-point format and IEEE 754 
double-precision
+# binary floating-point format. And we do assume the same thing here 
for now.
+'f': FloatType,
+'d': DoubleType
+}
+
+# compute array typecode mappings for signed integer types
+for _typecode in _array_signed_int_typecode_ctype_mappings.keys():
+size = 
ctypes.sizeof(_array_signed_int_typecode_ctype_mappings[_typecode]) * 8
+dt = _int_size_to_type(size)
+if dt is not None:
+_array_type_mappings[_typecode] = dt
+
+# compute array typecode mappings for unsigned integer types
+for _typecode in _array_unsigned_int_typecode_ctype_mappings.keys():
+# JVM does not have unsigned types, so use signed types that is at 
list 1
--- End diff --

nit: typo `at list`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128142573
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

Sounds good to me, too. Let's discuss it in the next pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128142511
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

https://issues.apache.org/jira/browse/SPARK-21465


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128141018
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

How about doing the following?
1. Make changes to tests as you suggest, make sure we clearly document why 
for python2 'L' is an exception as comments.
3. Open a new JIRA, explaining the issue about 'L' support.
4. Finish this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128141384
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

OK. Let me do this. If it is possible, I would like to finish this PR 
before I travel.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128141295
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

Sounds good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128140334
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

> Should we add 'L' as exception for python2 in unsupported types tests, or 
do we just completely remove unsupported tests?

I'd prefer 'L' as exception for Python 2.

> Should we test 'L' for python 2?

I don't think we should explicitly test this here. I guess we mistakenly 
happened to support this case as you said.

I think I (we contributors) can't make this decision (by rule) to remove 
this 'L' support ("undefined support status") but I would like to push this PR 
forward and this is why I was thinking deferring to a separate PR.

cc @ueshin, are you online now? WDYT removing out (unexpected I guess) 'L' 
support in Python 2?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128138629
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

If we fall back to `_infer_type`, then there should be some dirty changes 
in test cases to make it pass: Consider the following question:
1. Should we add 'L' as exception for python2 in unsupported types tests, 
or do we just completely remove unsupported tests?
2. Should we test 'L' for python 2?

I really like how the tests now are organized and these changes above will 
makes the test very messy. My opinion is, we are not changing the status of 'L' 
from "supported" to "unsupported", but from "undefined support status" to 
"unsupported". If changing from "undefined support status" to "unsupported" 
sounds bad, instead of making these changes to the test cases, I would rather 
to solve this problem now and keep the test cases clean.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128135623
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

We could disallow or support this in a separate PR later. I would rather 
make a safe choice here right now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128134574
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

Wait .. we could go decimal maybe ..? Let's just fall back to leave this 
case as is. I am quite sure that would be safer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128134072
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

Okay. Either way is fine to me (falling back to type inference or 
disallowing unmappable type). Please let me know if any reviewer thinks 
differently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128132778
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

@HyukjinKwon what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128132584
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

Actually I don't think 'L' should be supported because there is no type in 
Scala that can hold a 64bit unsigned integer...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128132120
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

@HyukjinKwon So, the conclusion is, we do support everything we supported 
before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128131993
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

Test result for python 2 is here:
```python
Python 2.7.13 (default, Jul  2 2017, 22:24:59) 
[GCC 7.1.1 20170621] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/07/18 20:59:26 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
17/07/18 20:59:26 WARN Utils: Your hostname, archlinux resolves to a 
loopback address: 127.0.0.1; using 192.168.88.2 instead (on interface eno1)
17/07/18 20:59:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
another address
17/07/18 20:59:29 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/

Using Python version 2.7.13 (default, Jul  2 2017 22:24:59)
SparkSession available as 'spark'.
>>> import array,sys
>>> from pyspark import *
>>> from pyspark.sql import *
>>> 
>>> def assertCollectSuccess(typecode, value):
...  row = Row(myarray=array.array(typecode, [value]))
...  df = spark.createDataFrame([row])
...  print(typecode)
...  df.show()
... 
>>> assertCollectSuccess('u',u'a')
u
+---+
|myarray|
+---+
|[a]|
+---+

>>> assertCollectSuccess('f',1.2)
f
+---+
|myarray|
+---+
| [null]|
+---+

>>> assertCollectSuccess('d',1.2)
d
+---+
|myarray|
+---+
|  [1.2]|
+---+

>>> assertCollectSuccess('b',1)
b
+---+
|myarray|
+---+
| [null]|
+---+

>>> assertCollectSuccess('B',1)
B
+---+
|myarray|
+---+
| [null]|
+---+

>>> assertCollectSuccess('h',1)
h
+---+
|myarray|
+---+
| [null]|
+---+

>>> assertCollectSuccess('H',1)
H
+---+
|myarray|
+---+
|[1]|
+---+

>>> assertCollectSuccess('i',1)
i
+---+
|myarray|
+---+
|[1]|
+---+

>>> assertCollectSuccess('I',1)
I
+---+
|myarray|
+---+
|[1]|
+---+

>>> assertCollectSuccess('l',1)
l
+---+
|myarray|
+---+
|[1]|
+---+

>>> assertCollectSuccess('L',1)
L
+---+
|myarray|
+---+
|[1]|
+---+

>>> 
>>> 
>>> if sys.version_info[0] > 2:
... assertCollectSuccess('q',1)
... assertCollectSuccess('Q',1)
... 
>>> 
>>> if sys.version_info[0] < 3:
... assertCollectSuccess('c','a')
... 
c
+---+
|myarray|
+---+
|[a]|
+---+

>>>
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128131947
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

Definitely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128131853
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

Test result of python3 & spark 2.2.0 is here. I will check on python 2 in a 
few minutes. I paste the whole command line output, it's a bit lengthy. 
@HyukjinKwon could you please double check if my test is complete?

```python
>>> import array,sys
>>> from pyspark import *
>>> from pyspark.sql import *
>>> 
>>> def assertCollectSuccess(typecode, value):
...  row = Row(myarray=array.array(typecode, [value]))
...  df = spark.createDataFrame([row])
...  print(typecode)
...  df.show()
... 
>>> assertCollectSuccess('u',u'a')
u
+---+
|myarray|
+---+
|[a]|
+---+

>>> assertCollectSuccess('f',1.2)
f
+---+
|myarray|
+---+
| [null]|
+---+

>>> assertCollectSuccess('d',1.2)
d
+---+
|myarray|
+---+
|  [1.2]|
+---+

>>> assertCollectSuccess('b',1)
b
+---+
|myarray|
+---+
| [null]|
+---+

>>> assertCollectSuccess('B',1)
B
+---+
|myarray|
+---+
| [null]|
+---+

>>> assertCollectSuccess('h',1)
h
+---+
|myarray|
+---+
| [null]|
+---+

>>> assertCollectSuccess('H',1)
H
+---+
|myarray|
+---+
|[1]|
+---+

>>> assertCollectSuccess('i',1)
i
+---+
|myarray|
+---+
|[1]|
+---+

>>> assertCollectSuccess('I',1)
I
+---+
|myarray|
+---+
|[1]|
+---+

>>> assertCollectSuccess('l',1)
l
+---+
|myarray|
+---+
|[1]|
+---+

>>> assertCollectSuccess('L',1)
L
17/07/18 20:55:55 ERROR Executor: Exception in task 14.0 in stage 119.0 
(TID 799)
net.razorvine.pickle.PickleException: unsupported datatype: 64-bits 
unsigned long
at 
net.razorvine.pickle.objects.ArrayConstructor.constructLongArrayFromUInt64(ArrayConstructor.java:302)
at 
net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:240)
at 
net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:26)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152)
at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128130893
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala 
---
@@ -55,13 +55,12 @@ private[spark] object SerDeUtil extends Logging {
 //{'d', sizeof(double), d_getitem, d_setitem},
 //{'\0', 0, 0, 0} /* Sentinel */
 //  };
-// TODO: support Py_UNICODE with 2 bytes
 val machineCodes: Map[Char, Int] = if 
(ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN)) {
-  Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' 
-> 9,
+  Map('B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9,
 'L' -> 11, 'l' -> 13, 'f' -> 15, 'd' -> 17, 'u' -> 21
   )
 } else {
-  Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' 
-> 8,
+  Map('B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8,
--- End diff --

Yes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128130330
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala 
---
@@ -55,13 +55,12 @@ private[spark] object SerDeUtil extends Logging {
 //{'d', sizeof(double), d_getitem, d_setitem},
 //{'\0', 0, 0, 0} /* Sentinel */
 //  };
-// TODO: support Py_UNICODE with 2 bytes
 val machineCodes: Map[Char, Int] = if 
(ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN)) {
-  Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' 
-> 9,
+  Map('B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9,
 'L' -> 11, 'l' -> 13, 'f' -> 15, 'd' -> 17, 'u' -> 21
   )
 } else {
-  Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' 
-> 8,
+  Map('B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8,
--- End diff --

Ah, yea. I meant to say `'c' -> 1` was not ever reachable for sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128129716
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

Yes, this worth a double check. Let me do some test on old spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128129350
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

(@zasdfgbnm I edited comments here)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128129280
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

If we fall back to `_infer_type` here, then `array('q')` will be inferred 
as something like array of Long, and then the user will see an error from 
`net.razorvine.pickle` saying that bad array typecode `q`, due to 
`SPARK-21420`, which seems to me more confusing than explicitly fails here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128128844
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

Okay. To double check, it covers all the cases we supported before in 
Spark? If it can be for sure, I am fine with as is.

I was trying to leave a sign-off for this reason - fixing a case we never 
supported before and a bug (assigning the correct type for `array.array`), and 
all other cases work as was.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128128561
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala 
---
@@ -55,13 +55,12 @@ private[spark] object SerDeUtil extends Logging {
 //{'d', sizeof(double), d_getitem, d_setitem},
 //{'\0', 0, 0, 0} /* Sentinel */
 //  };
-// TODO: support Py_UNICODE with 2 bytes
 val machineCodes: Map[Char, Int] = if 
(ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN)) {
-  Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' 
-> 9,
+  Map('B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9,
 'L' -> 11, 'l' -> 13, 'f' -> 15, 'd' -> 17, 'u' -> 21
   )
 } else {
-  Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' 
-> 8,
+  Map('B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8,
--- End diff --

`c` was supported by spark 2.2.0:
```python
>>> row = Row(myarray=array('c', ["a"]))
>>> df = spark.createDataFrame([row])
>>> df.first()["myarray"][0]
u'a'
```

This support I think was because array was infered the same way as list. 
But after we make changes on the type infer of array, we have to change this 
accordingly to bring it back to work. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128127304
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

Is there any reason to do so? I don't think array with unsupported typecode 
will be correctly serialized or deserialized. In this case, it would be better 
to raise an TypeError and let the user to pick another type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128124840
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala 
---
@@ -55,13 +55,12 @@ private[spark] object SerDeUtil extends Logging {
 //{'d', sizeof(double), d_getitem, d_setitem},
 //{'\0', 0, 0, 0} /* Sentinel */
 //  };
-// TODO: support Py_UNICODE with 2 bytes
 val machineCodes: Map[Char, Int] = if 
(ByteOrder.nativeOrder().equals(ByteOrder.BIG_ENDIAN)) {
-  Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' 
-> 9,
+  Map('B' -> 0, 'b' -> 1, 'H' -> 3, 'h' -> 5, 'I' -> 7, 'i' -> 9,
 'L' -> 11, 'l' -> 13, 'f' -> 15, 'd' -> 17, 'u' -> 21
   )
 } else {
-  Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' 
-> 8,
+  Map('B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8,
--- End diff --

Last question to double check. So, 'c' did not ever work in Python 2 but 
this PR fixes it? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r128124801
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1016,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), False)
+else:
+raise TypeError("not supported type: array(%s)" % obj.typecode)
--- End diff --

How about failling back to type inference in this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127839034
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1023,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
--- End diff --

And also keys() makes a copy of a key list in python 2 whereas in operator 
against a dict directly calls `__contains__` if I remember this correctly. 
Let's leave this out in any event.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127837941
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1023,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
--- End diff --

Yea. It should be O(1) vs O(n) IIRC. Thanks for testing out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-17 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127837846
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1023,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
--- End diff --

Yea. It should be O(1) vs O(n) IIRC. Thanks god testing out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-17 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127814477
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1023,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
--- End diff --

I don't think `i in my_dict` and `i in my_dict.keys()` has much difference 
in performance. A simple test
```python
from random import random,shuffle
from time import time

N = 1000
d = {}
for i in range(N):
d[i] = random()

tests = list(range(2*N))
shuffle(tests)

time1 = time()

dummy = 0
for i in tests:
if i in d:
dummy += 1

time2 = time()

dummy = 0
for i in tests:
if i in d.keys():
dummy += 1

time3 = time()

print(time2-time1)
print(time3-time2)
```
gives
```
11.389298915863037
12.40167999268
```
on my MacBook Pro


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127584948
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1023,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
--- End diff --

Oh, I take it back. This is a possibly hot path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127584880
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -30,6 +30,10 @@
 import functools
 import time
 import datetime
+import array
+import math
--- End diff --

BTW, this import looks not used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127584800
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala 
---
@@ -57,11 +57,11 @@ private[spark] object SerDeUtil extends Logging {
 //  };
 // TODO: support Py_UNICODE with 2 bytes
--- End diff --

Yea.. actually, this is the reason why I don't like to leave a todo. Yes, 
it looks apparently fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127584731
  
--- Diff: python/pyspark/sql/types.py ---
@@ -915,6 +916,90 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision. Also, JVM only support signed types, when converting unsigned 
types,
+# keep in mind that it required 1 more bit when stored as singed types.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer 
types .
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+# Reference for JVM's supported integral types:
+# http://docs.oracle.com/javase/specs/jvms/se8/html/jvms-2.html#jvms-2.3.1
+
+_array_signed_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'h': ctypes.c_short,
+'i': ctypes.c_int,
+'l': ctypes.c_long,
+}
+
+_array_unsigned_int_typecode_ctype_mappings = {
+'B': ctypes.c_ubyte,
+'H': ctypes.c_ushort,
+'I': ctypes.c_uint,
+'L': ctypes.c_ulong
+}
+
+# TODO: [SPARK-21420]
+# Uncomment this when 'q' and 'Q' are supported by net.razorvine.pickle
+# Type code 'q' and 'Q' are not available at python 2
+# if sys.version_info[0] >= 3:
+# _array_signed_int_typecode_ctype_mappings['q'] = ctypes.c_longlong
+# _array_unsigned_int_typecode_ctype_mappings['Q'] = ctypes.c_ulonglong
--- End diff --

Personally, I don't like leaving a todo in codes. I am pretty sure that we 
have unresolved (or probably not removed, although already fixed) todos. At 
least, I cleared few todos before.

I guess `SPARK-21420` has a good description. Let's remove this as the 
commit log and discussion keep the code changes and context.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127584576
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -30,6 +30,10 @@
 import functools
 import time
 import datetime
+import array
+import math
+import ctypes
+
--- End diff --

little nit: could we remove this extra newline?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127584600
  
--- Diff: python/pyspark/sql/types.py ---
@@ -938,12 +1023,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
--- End diff --

Could we do explicitly `_array_type_mappings.keys()` if you are fine? I 
found it is confusing sometimes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-14 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127555013
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala 
---
@@ -57,11 +57,11 @@ private[spark] object SerDeUtil extends Logging {
 //  };
 // TODO: support Py_UNICODE with 2 bytes
--- End diff --

Since `array('u')` pass the tests, should this TODO be removed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-14 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127521525
  
--- Diff: python/pyspark/sql/types.py ---
@@ -915,6 +916,90 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer 
types .
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
+# Type code 'q' and 'Q' are not available at python 2
+# if sys.version_info[0] >= 3:
+# _array_int_typecode_ctype_mappings.update({
+# 'q': ctypes.c_longlong,
+# 'Q': ctypes.c_ulonglong
+# })
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
+"""
+if size <= 8:
+return ByteType
+if size <= 16:
+return ShortType
+if size <= 32:
+return IntegerType
+if size <= 64:
+return LongType
+raise TypeError("not supported type: integer size too large.")
+
+# The list of all supported array typecodes is stored here
+_array_type_mappings = {
+# Warning: Actual properties for float and double in C is not 
specified in C.
+# On almost every system supported by both python and JVM, they are 
IEEE 754
+# single-precision binary floating-point format and IEEE 754 
double-precision
+# binary floating-point format. And we do assume the same thing here 
for now.
+'f': FloatType,
+'d': DoubleType
+}
+
+# compute array typecode mappings for integer types
+for _typecode in _array_int_typecode_ctype_mappings.keys():
+size = ctypes.sizeof(_array_int_typecode_ctype_mappings[_typecode]) * 8
+if _typecode.isupper():  # 1 extra bit is required to store unsigned 
types
--- End diff --

This is actually a good suggestion!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-14 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127513810
  
--- Diff: python/pyspark/sql/types.py ---
@@ -935,6 +936,90 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer 
types .
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
--- End diff --

I created a JIRA at https://issues.apache.org/jira/browse/SPARK-21420
I updated the comment here and add reference to that JIRA. I didn't remove 
this todo because I think it might be a help for future fix.
Do you prefer to remove this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-14 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127420412
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2312,6 +2317,67 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+# test for SPARK-16542
+def test_array_types(self):
--- End diff --

Minor suggestions:

```diff
@@ -2327,23 +2327,29 @@ class SQLTests(ReusedPySparkTestCase):
 # See: https://docs.python.org/2/library/array.html

 def assertCollectSuccess(typecode, value):
-a = array.array(typecode, [value])
-row = Row(myarray=a)
+row = Row(myarray=array.array(typecode, [value]))
 df = self.spark.createDataFrame([row])
-self.assertEqual(df.collect()[0]["myarray"][0], value)
+self.assertEqual(df.first()["myarray"][0], value)

-supported_types = []
-
-# test string types
+# supported string types
+#
+# blabla... "u" will be removed in python 4 blabla...
+# and "c" not supported in python 3 blabla ...
+supported_string_types = []
 if sys.version_info[0] < 4:
-supported_types += ['u']
+supported_string_types += ['u']
+# test unicode
 assertCollectSuccess('u', u"a")
 if sys.version_info[0] < 3:
-supported_types += ['c']
+supported_string_types += ['c']
+# test string
 assertCollectSuccess('c', "a")

+# supported float and double
+#
+# tests float max min blabla
+supported_fractional_types = ['f', 'd']
 # test float and double, assuming IEEE 754 floating-point format
-supported_types += ['f', 'd']
 assertCollectSuccess('f', ctypes.c_float(1e+38).value)
 assertCollectSuccess('f', ctypes.c_float(1e-38).value)
 assertCollectSuccess('f', ctypes.c_float(1.123456).value)
@@ -2351,33 +2357,48 @@ class SQLTests(ReusedPySparkTestCase):
 assertCollectSuccess('d', sys.float_info.min)
 assertCollectSuccess('d', sys.float_info.epsilon)

+# supported int types
+#
+# blabla .. only supported int types.. blabla..
+supported_int_types = list(
+set(_array_int_typecode_ctype_mappings.keys())
+.intersection(set(_array_type_mappings.keys(
 # test int types
-supported_int = 
list(set(_array_int_typecode_ctype_mappings.keys()).
- 
intersection(set(_array_type_mappings.keys(
-supported_types += supported_int
-for i in supported_int:
-ctype = _array_int_typecode_ctype_mappings[i]
-if i.isupper():  # unsigned
-assertCollectSuccess(i, 2 ** (ctypes.sizeof(ctype) * 8) - 
1)
-else:  # signed
+for t in supported_int_types:
+ctype = _array_int_typecode_ctype_mappings[t]
+if t.isupper():
+# test unsigned int types
+assertCollectSuccess(t, 2 ** (ctypes.sizeof(ctype) * 8) - 
1)
+else:
+# test signed int types
 max_val = 2 ** (ctypes.sizeof(ctype) * 8 - 1)
-assertCollectSuccess(i, max_val - 1)
-assertCollectSuccess(i, -max_val)
-
-# make sure that the test case cover all supported types
+assertCollectSuccess(t, max_val - 1)
+assertCollectSuccess(t, -max_val)
+
+# all supported types
+#
+# Make sure all the supported types are blabla ...
+supported_types = (supported_string_types +
+   supported_fractional_types +
+   supported_int_types)
+# test these are all supported types
 self.assertEqual(set(supported_types), 
set(_array_type_mappings.keys()))

-# test unsupported types
+# all unsupported types
+#
+# ... ... types are not supported in python 2/3 blabla.
 if sys.version_info[0] < 3:
-all_type_codes = set(['c', 'b', 'B', 'u', 'h', 'H', 'i', 'I', 
'l', 'L', 'f', 'd'])
+all_types = set(['c', 'b', 'B', 'u', 'h', 'H', 'i', 'I', 'l', 
'L', 'f', 'd'])
 else:
-all_type_codes = set(array.typecodes)
-unsupported_types = all_type_codes - set(supported_types)
+all_types = set(array.typecodes)
+unsupported_types = all_types -

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-14 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127421739
  
--- Diff: python/pyspark/sql/types.py ---
@@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, Â§ 5.2.4.2.1 Sizes of integer types 
.
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
+# Type code 'q' and 'Q' are not available at python 2
+# if sys.version > "2":
+# _array_int_typecode_ctype_mappings.update({
+# 'q': ctypes.c_longlong,
+# 'Q': ctypes.c_ulonglong
+# })
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
+"""
+if size <= 8:
+return ByteType
+if size <= 16:
+return ShortType
+if size <= 32:
+return IntegerType
+if size <= 64:
+return LongType
+raise TypeError("not supported type: integer size too large.")
--- End diff --

I don't think we should rely on exception handling here.

Could we do something like the one below?

```python

def _int_size_to_type(size):
"""
Return the Scala type from the size of integers.
"""
if size <= 8:
return ByteType
if size <= 16:
return ShortType
if size <= 32:
return IntegerType
if size <= 64:
return LongType
...
# compute array typecode mappings for integer types
...
dt = _int_size_to_type(size)
if dt is not None:
_array_type_mappings[_typecode] = dt
else:
raise TypeError("not supported type: integer size too large.")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-14 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r127421891
  
--- Diff: python/pyspark/sql/types.py ---
@@ -915,6 +916,90 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer 
types .
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
+# Type code 'q' and 'Q' are not available at python 2
+# if sys.version_info[0] >= 3:
+# _array_int_typecode_ctype_mappings.update({
+# 'q': ctypes.c_longlong,
+# 'Q': ctypes.c_ulonglong
+# })
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
+"""
+if size <= 8:
+return ByteType
+if size <= 16:
+return ShortType
+if size <= 32:
+return IntegerType
+if size <= 64:
+return LongType
+raise TypeError("not supported type: integer size too large.")
+
+# The list of all supported array typecodes is stored here
+_array_type_mappings = {
+# Warning: Actual properties for float and double in C is not 
specified in C.
+# On almost every system supported by both python and JVM, they are 
IEEE 754
+# single-precision binary floating-point format and IEEE 754 
double-precision
+# binary floating-point format. And we do assume the same thing here 
for now.
+'f': FloatType,
+'d': DoubleType
+}
+
+# compute array typecode mappings for integer types
+for _typecode in _array_int_typecode_ctype_mappings.keys():
+size = ctypes.sizeof(_array_int_typecode_ctype_mappings[_typecode]) * 8
+if _typecode.isupper():  # 1 extra bit is required to store unsigned 
types
--- End diff --

I would rather make a separate list for unsugned types rather than adding a 
if here and in the test (not strong opinion).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-14 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r126072461
  
--- Diff: python/pyspark/sql/types.py ---
@@ -935,6 +936,90 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer 
types .
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
--- End diff --

Could we open a JIRA with the suggested comments below and remove this todo 
and comments here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-14 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r126072745
  
--- Diff: python/pyspark/sql/types.py ---
@@ -935,6 +936,90 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, chapter 5.2.4.2.1 Sizes of integer 
types .
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
+# Type code 'q' and 'Q' are not available at python 2
+# if sys.version_info[0] >= 3:
+# _array_int_typecode_ctype_mappings.update({
+# 'q': ctypes.c_longlong,
+# 'Q': ctypes.c_ulonglong
+# })
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
+"""
+if size <= 8:
+return ByteType
+if size <= 16:
+return ShortType
+if size <= 32:
+return IntegerType
+if size <= 64:
+return LongType
+raise TypeError("not supported type: integer size too large.")
+
+# The list of all supported array typecodes is stored here
+_array_type_mappings = {
+# Warning: Actual properties for float and double in C is not 
specified in C.
+# On almost every system supported by both python and JVM, they are 
IEEE 754
+# single-precision binary floating-point format and IEEE 754 
double-precision
+# binary floating-point format. And we do assume the same thing here 
for now.
+'f': FloatType,
+'d': DoubleType
+}
+
+# compute array typecode mappings for integer types
+for _typecode in _array_int_typecode_ctype_mappings.keys():
+size = ctypes.sizeof(_array_int_typecode_ctype_mappings[_typecode]) * 8
+if _typecode.isupper():  # 1 extra bit is required to store unsigned 
types
+size += 1
+try:
+_array_type_mappings.update({
+_typecode: _int_size_to_type(size)
+})
--- End diff --

How about `_array_type_mappings[_typecode] = _int_size_to_type(size)`? (and 
for other same instances here)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-11 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r126809517
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala 
---
@@ -72,7 +72,11 @@ private[spark] object SerDeUtil extends Logging {
 val typecode = args(0).asInstanceOf[String].charAt(0)
 // This must be ISO 8859-1 / Latin 1, not UTF-8, to interoperate 
correctly
 val data = 
args(1).asInstanceOf[String].getBytes(StandardCharsets.ISO_8859_1)
--- End diff --

Can anyone explain why `ISO_8859_1` is used here instead of UTF16 or UTF32?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-11 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r126806942
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala 
---
@@ -72,7 +72,11 @@ private[spark] object SerDeUtil extends Logging {
 val typecode = args(0).asInstanceOf[String].charAt(0)
 // This must be ISO 8859-1 / Latin 1, not UTF-8, to interoperate 
correctly
 val data = 
args(1).asInstanceOf[String].getBytes(StandardCharsets.ISO_8859_1)
-construct(typecode, machineCodes(typecode), data)
+val machine_code = machineCodes(typecode)
+// fix data alignment
+val unit_length = if (machine_code==18 || machine_code==19) 2 else 
4
+val aligned_data = data ++ Array.fill[Byte](unit_length - 
data.length % unit_length)(0)
--- End diff --

Not done yet. I think this will only works on little endian


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-06 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125910692
  
--- Diff: python/pyspark/sql/types.py ---
@@ -958,12 +1043,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), True)
--- End diff --

Yes, you are right
```python
>>> from array import *
>>> for i in set(typecodes):
... try:
... array(i,[None])
... print(i,'success')
... except:
... print(i,'fail')
... 
L fail
I fail
b fail
B fail
i fail
H fail
q fail
Q fail
f fail
l fail
u fail
h fail
d fail
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-06 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125855962
  
--- Diff: python/pyspark/sql/types.py ---
@@ -958,12 +1043,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), True)
--- End diff --

Btw, can `array` have `None` value?
If not, we can use `False` for `containsNull` of `ArrayType`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-05 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125642042
  
--- Diff: python/pyspark/sql/types.py ---
@@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, Â§ 5.2.4.2.1 Sizes of integer types 
.
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
+# Type code 'q' and 'Q' are not available at python 2
+# if sys.version > "2":
+# _array_int_typecode_ctype_mappings.update({
+# 'q': ctypes.c_longlong,
+# 'Q': ctypes.c_ulonglong
+# })
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
+"""
+if size <= 8:
+return ByteType
+if size <= 16:
+return ShortType
+if size <= 32:
+return IntegerType
+if size <= 64:
+return LongType
+raise TypeError("not supported type: integer size too large.")
--- End diff --

fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-01 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125173526
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2250,6 +2256,67 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+# test for SPARK-16542
+def test_array_types(self):
+# This test need to make sure that the Scala type selected is at 
least
+# as large as the python's types. This is necessary because 
python's
+# array types depend on C implementation on the machine. Therefore 
there
+# is no machine independent correspondence between python's array 
types
+# and Scala types.
+# See: https://docs.python.org/2/library/array.html
+
+def assertCollectSuccess(typecode, value):
+a = array.array(typecode, [value])
+row = Row(myarray=a)
+df = self.spark.createDataFrame([row])
+self.assertEqual(df.collect()[0]["myarray"][0], value)
+
+supported_types = []
+
+# test string types
+if sys.version < "4":
--- End diff --

Yes, this looks better I think. I will change that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-01 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125173517
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2250,6 +2256,67 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+# test for SPARK-16542
+def test_array_types(self):
+# This test need to make sure that the Scala type selected is at 
least
+# as large as the python's types. This is necessary because 
python's
+# array types depend on C implementation on the machine. Therefore 
there
+# is no machine independent correspondence between python's array 
types
+# and Scala types.
+# See: https://docs.python.org/2/library/array.html
+
+def assertCollectSuccess(typecode, value):
+a = array.array(typecode, [value])
+row = Row(myarray=a)
+df = self.spark.createDataFrame([row])
+self.assertEqual(df.collect()[0]["myarray"][0], value)
+
+supported_types = []
+
+# test string types
+if sys.version < "4":
+supported_types += ['u']
+assertCollectSuccess('u', "a")
+if sys.version < "3":
+supported_types += ['c']
+assertCollectSuccess('c', "a")
+
+# test float and double, assuming IEEE 754 floating-point format
+supported_types += ['f', 'd']
+assertCollectSuccess('f', ctypes.c_float(1e+38).value)
+assertCollectSuccess('f', ctypes.c_float(1e-38).value)
+assertCollectSuccess('f', ctypes.c_float(1.123456).value)
+assertCollectSuccess('d', ctypes.c_double(1e+308).value)
+assertCollectSuccess('d', ctypes.c_double(1e+308).value)
+assertCollectSuccess('d', ctypes.c_double(1.123456789012345).value)
+
+# test int types
+supported_int = 
list(set(_array_int_typecode_ctype_mappings.keys()).
+ 
intersection(set(_array_type_mappings.keys(
+supported_types += supported_int
+for i in supported_int:
+ctype = _array_int_typecode_ctype_mappings[i]
+if i.isupper():
--- End diff --

Yes, good idea. I will change that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-01 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125173508
  
--- Diff: python/pyspark/sql/types.py ---
@@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, Â§ 5.2.4.2.1 Sizes of integer types 
.
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
+# Type code 'q' and 'Q' are not available at python 2
+# if sys.version > "2":
+# _array_int_typecode_ctype_mappings.update({
+# 'q': ctypes.c_longlong,
+# 'Q': ctypes.c_ulonglong
+# })
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
+"""
+if size <= 8:
+return ByteType
+if size <= 16:
+return ShortType
+if size <= 32:
+return IntegerType
+if size <= 64:
+return LongType
+raise TypeError("not supported type: integer size too large.")
--- End diff --

But I think I would add a simple line of comment on this to make the code 
more readable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-01 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125173496
  
--- Diff: python/pyspark/sql/types.py ---
@@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, Â§ 5.2.4.2.1 Sizes of integer types 
.
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
+# Type code 'q' and 'Q' are not available at python 2
+# if sys.version > "2":
+# _array_int_typecode_ctype_mappings.update({
+# 'q': ctypes.c_longlong,
+# 'Q': ctypes.c_ulonglong
+# })
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
+"""
+if size <= 8:
+return ByteType
+if size <= 16:
+return ShortType
+if size <= 32:
+return IntegerType
+if size <= 64:
+return LongType
+raise TypeError("not supported type: integer size too large.")
--- End diff --

I don't think we should log this. This is just a helper function that helps 
to construct `_array_type_mappings`, which is a complete list of all supported 
type codes. Being filtered out here is not an error, it's by design. If users 
try to use unsupported typecode, they will see another `TypeError` due to line 
1052:
```python
.
elif isinstance(obj, array):
if obj.typecode in _array_type_mappings:
return ArrayType(_array_type_mappings[obj.typecode](), True)
else:
raise TypeError("not supported type: array(%s)" % obj.typecode)
.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-01 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125173449
  
--- Diff: python/pyspark/sql/types.py ---
@@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, Â§ 5.2.4.2.1 Sizes of integer types 
.
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
+# Type code 'q' and 'Q' are not available at python 2
+# if sys.version > "2":
+# _array_int_typecode_ctype_mappings.update({
+# 'q': ctypes.c_longlong,
+# 'Q': ctypes.c_ulonglong
+# })
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
+"""
+if size <= 8:
+return ByteType
+if size <= 16:
+return ShortType
+if size <= 32:
+return IntegerType
+if size <= 64:
+return LongType
+raise TypeError("not supported type: integer size too large.")
+
+_array_type_mappings = {
+# Warning: Actual properties for float and double in C is not 
unspecified.
--- End diff --

Yes, you are correct. Thanks for figuring this out. 

I just checked, `sys.float_info` is the info for C type double, not for C 
type float:
```python
>>> import sys
>>> sys.float_info.max
1.7976931348623157e+308
>>> sys.float_info.dig
15
```
So we can not use this to check range for C float. But this is not the main 
reason I'm not using it. 

The main reason is:
Although C does not specify that we have to use IEEE-754 floating point 
types, all the C platform I have ever seen uses IEEE-754. (Also there is a 
StackOverflow question about this: 
https://stackoverflow.com/questions/31967040/is-it-safe-to-assume-floating-point-is-represented-using-ieee754-floats-in-c
 ) I don't even know if there exists a platform in the world that: python is 
supported, JVM is supported, and floating point types in C does not use 
IEEE-754. So, I think it would be OK to assume that these types are IEEE-754 
for now to make the code cleaner. It does not worth any effort to support 
something that might not even exist. But I'm not an expert on this either, so 
if you know someone that might know more on this, please ping them to double 
check. On the other hand, If there do exist users that use these platform find 
that this is a wrong assumption, they can report a new bug to fix this.  But 
yes, my comment seems to be confusing and I will try if I can make it clearer.

Also, thank you for pointing out the `sys.float_info`, although I don't 
think I need to use it here, it would be very useful in test cases. I will 
change part of my test cases to use it to make code more readable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-01 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125172166
  
--- Diff: python/pyspark/sql/types.py ---
@@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, Â§ 5.2.4.2.1 Sizes of integer types 
.
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
+# Type code 'q' and 'Q' are not available at python 2
+# if sys.version > "2":
+# _array_int_typecode_ctype_mappings.update({
+# 'q': ctypes.c_longlong,
+# 'Q': ctypes.c_ulonglong
+# })
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
+"""
+if size <= 8:
+return ByteType
+if size <= 16:
+return ShortType
+if size <= 32:
+return IntegerType
+if size <= 64:
+return LongType
+raise TypeError("not supported type: integer size too large.")
--- End diff --

So in this case we are silently filtering out integer types from supported 
conversions - do we expect this to happen? Should we log this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-01 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125171058
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2250,6 +2256,67 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+# test for SPARK-16542
+def test_array_types(self):
+# This test need to make sure that the Scala type selected is at 
least
+# as large as the python's types. This is necessary because 
python's
+# array types depend on C implementation on the machine. Therefore 
there
+# is no machine independent correspondence between python's array 
types
+# and Scala types.
+# See: https://docs.python.org/2/library/array.html
+
+def assertCollectSuccess(typecode, value):
+a = array.array(typecode, [value])
+row = Row(myarray=a)
+df = self.spark.createDataFrame([row])
+self.assertEqual(df.collect()[0]["myarray"][0], value)
+
+supported_types = []
+
+# test string types
+if sys.version < "4":
+supported_types += ['u']
+assertCollectSuccess('u', "a")
+if sys.version < "3":
+supported_types += ['c']
+assertCollectSuccess('c', "a")
+
+# test float and double, assuming IEEE 754 floating-point format
+supported_types += ['f', 'd']
+assertCollectSuccess('f', ctypes.c_float(1e+38).value)
+assertCollectSuccess('f', ctypes.c_float(1e-38).value)
+assertCollectSuccess('f', ctypes.c_float(1.123456).value)
+assertCollectSuccess('d', ctypes.c_double(1e+308).value)
+assertCollectSuccess('d', ctypes.c_double(1e+308).value)
+assertCollectSuccess('d', ctypes.c_double(1.123456789012345).value)
+
+# test int types
+supported_int = 
list(set(_array_int_typecode_ctype_mappings.keys()).
+ 
intersection(set(_array_type_mappings.keys(
+supported_types += supported_int
+for i in supported_int:
+ctype = _array_int_typecode_ctype_mappings[i]
+if i.isupper():
--- End diff --

In the code that makes the mapping you have a comment about isupper being 
unsigned, for Scala SQL devs who maybe have to debug this in the future, I'd 
duplicate this comment here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-01 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125171044
  
--- Diff: python/pyspark/sql/types.py ---
@@ -935,6 +936,86 @@ def _parse_datatype_json_value(json_value):
 long: LongType,
 })
 
+# Mapping Python array types to Spark SQL DataType
+# We should be careful here. The size of these types in python depends on C
+# implementation. We need to make sure that this conversion does not lose 
any
+# precision.
+#
+# Reference for C integer size, see:
+# ISO/IEC 9899:201x specification, Â§ 5.2.4.2.1 Sizes of integer types 
.
+# Reference for python array typecode, see:
+# https://docs.python.org/2/library/array.html
+# https://docs.python.org/3.6/library/array.html
+
+_array_int_typecode_ctype_mappings = {
+'b': ctypes.c_byte,
+'B': ctypes.c_ubyte,
+'h': ctypes.c_short,
+'H': ctypes.c_ushort,
+'i': ctypes.c_int,
+'I': ctypes.c_uint,
+'l': ctypes.c_long,
+'L': ctypes.c_ulong
+}
+
+# TODO: Uncomment this when 'q' and 'Q' are supported by 
net.razorvine.pickle
+# Type code 'q' and 'Q' are not available at python 2
+# if sys.version > "2":
+# _array_int_typecode_ctype_mappings.update({
+# 'q': ctypes.c_longlong,
+# 'Q': ctypes.c_ulonglong
+# })
+
+
+def _int_size_to_type(size):
+"""
+Return the Scala type from the size of integers.
+"""
+if size <= 8:
+return ByteType
+if size <= 16:
+return ShortType
+if size <= 32:
+return IntegerType
+if size <= 64:
+return LongType
+raise TypeError("not supported type: integer size too large.")
+
+_array_type_mappings = {
+# Warning: Actual properties for float and double in C is not 
unspecified.
--- End diff --

So I think you mean to say "is not specified"
On the other hand, is there a reason we couldn't use sys.float_info to 
determine the ranges on this? I have not done a lot of poking with float arrays 
in Python though so just a question.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-07-01 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r125170990
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2250,6 +2256,67 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+# test for SPARK-16542
+def test_array_types(self):
+# This test need to make sure that the Scala type selected is at 
least
+# as large as the python's types. This is necessary because 
python's
+# array types depend on C implementation on the machine. Therefore 
there
+# is no machine independent correspondence between python's array 
types
+# and Scala types.
+# See: https://docs.python.org/2/library/array.html
+
+def assertCollectSuccess(typecode, value):
+a = array.array(typecode, [value])
+row = Row(myarray=a)
+df = self.spark.createDataFrame([row])
+self.assertEqual(df.collect()[0]["myarray"][0], value)
+
+supported_types = []
+
+# test string types
+if sys.version < "4":
--- End diff --

consider maybe using sys.version_info?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-06-28 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r124676549
  
--- Diff: python/pyspark/sql/types.py ---
@@ -958,12 +968,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), True)
--- End diff --

I meant 
https://github.com/apache/spark/commit/67c75021c59d93cda9b5d70c0ef6d547fff92083.
 I think it is okay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-06-28 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r124572922
  
--- Diff: python/pyspark/sql/types.py ---
@@ -958,12 +968,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), True)
--- End diff --

@HyukjinKwon Could you send me the link to "the Pandas type related PR we 
helped review before"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-06-28 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r124536696
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2259,6 +2261,69 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+# test for SPARK-16542
+def test_array_types(self):
+# This test need to make sure that the Scala type selected is at 
least
+# as large as the python's types. This is necessary because 
python's
+# array types depend on C implementation on the machine. Therefore 
there
+# is no machine independent correspondence between python's array 
types
+# and Scala types.
+# See: https://docs.python.org/2/library/array.html
--- End diff --

@ueshin The answer to your question is explained here, a couple lines of 
comments I just added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-06-28 Thread zasdfgbnm

Github user zasdfgbnm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r124531036
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2259,6 +2261,60 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+# test for SPARK-16542
+def test_array_types(self):
+int_types = set(['b', 'h', 'i', 'l'])
+float_types = set(['f', 'd'])
+unsupported_types = set(array.typecodes) - int_types - float_types
+
+def collected(a):
+row = Row(myarray=a)
+rdd = self.sc.parallelize([row])
+df = self.spark.createDataFrame(rdd)
+return df.collect()[0]["myarray"][0]
+# test whether pyspark can correctly handle int types
+for t in int_types:
+# test positive numbers
+a = array.array(t, [1])
+while True:
--- End diff --

Let me figure that out. This patch was more than a year ago and I forget 
what was happening.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-06-28 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r124489677
  
--- Diff: python/pyspark/sql/types.py ---
@@ -958,12 +968,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), True)
--- End diff --

Thanks for cc'ing me. If we say `float` being `double` is a bug, I think 
this is the same instance of the Pandas type related PRs we helped review 
before.








---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-06-28 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r124464895
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2259,6 +2261,60 @@ def test_BinaryType_serialization(self):
 df = self.spark.createDataFrame(data, schema=schema)
 df.collect()
 
+# test for SPARK-16542
+def test_array_types(self):
+int_types = set(['b', 'h', 'i', 'l'])
+float_types = set(['f', 'd'])
+unsupported_types = set(array.typecodes) - int_types - float_types
+
+def collected(a):
+row = Row(myarray=a)
+rdd = self.sc.parallelize([row])
+df = self.spark.createDataFrame(rdd)
+return df.collect()[0]["myarray"][0]
+# test whether pyspark can correctly handle int types
+for t in int_types:
+# test positive numbers
+a = array.array(t, [1])
+while True:
--- End diff --

What is this loop for?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-06-28 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18444#discussion_r124465362
  
--- Diff: python/pyspark/sql/types.py ---
@@ -958,12 +968,17 @@ def _infer_type(obj):
 return MapType(_infer_type(key), _infer_type(value), True)
 else:
 return MapType(NullType(), NullType(), True)
-elif isinstance(obj, (list, array)):
+elif isinstance(obj, list):
 for v in obj:
 if v is not None:
 return ArrayType(_infer_type(obj[0]), True)
 else:
 return ArrayType(NullType(), True)
+elif isinstance(obj, array):
+if obj.typecode in _array_type_mappings:
+return ArrayType(_array_type_mappings[obj.typecode](), True)
--- End diff --

I'm not sure if we can change the element type. Doesn't this break 
backward-compatibility? cc @HyukjinKwon 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types ...

2017-06-27 Thread zasdfgbnm

GitHub user zasdfgbnm opened a pull request:

https://github.com/apache/spark/pull/18444

[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of 
null when creating DataFrame using python

## What changes were proposed in this pull request?
This is the reopen of https://github.com/apache/spark/pull/14198, with 
merge conflicts resolved. 

@ueshin Could you please take a look at my code?

Fix bugs about types that result an array of null when creating DataFrame 
using python.

Python's array.array have richer type than python itself, e.g. we can have 
`array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark 
didn't take this into consideration which might cause a problem that you get an 
array of null values when you have `array('f')` in your rows.

A simple code to reproduce this bug is:

```
from pyspark import SparkContext
from pyspark.sql import SQLContext,Row,DataFrame
from array import array

sc = SparkContext()
sqlContext = SQLContext(sc)

row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
rows = sc.parallelize([ row1 ])
df = sqlContext.createDataFrame(rows)
df.show()
```

which have output

```
+---+--+
|doublearray|floatarray|
+---+--+
|[1.0, 2.0, 3.0]|[null, null, null]|
+---+--+
```

## How was this patch tested?

New test case added


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zasdfgbnm/spark fix_array_infer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18444.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18444


commit a127486d59528eae452dcbcc2ccfb68fdd7769b7
Author: Xiang Gao 
Date:   2016-07-09T00:58:14Z

use array.typecode to infer type

Python's array has more type than python it self, for example
python only has float while array support 'f' (float) and 'd' (double)
Switching to array.typecode helps spark make a better inference

For example, for the code:

from pyspark.sql.types import _infer_type
from array import array
a = array('f',[1,2,3,4,5,6])
_infer_type(a)

We will get ArrayType(DoubleType,true) before change,
but ArrayType(FloatType,true) after change

commit 70131f3b81575edf9073d5be72553730d6316bd6
Author: Xiang Gao 
Date:   2016-07-09T06:21:31Z

Merge branch 'master' into fix_array_infer

commit 505e819f415c2f754b5147908516ace6f6ddfe78
Author: Xiang Gao 
Date:   2016-07-13T12:53:18Z

sync with upstream

commit 05979ca6eabf723cf3849ec2bf6f6e9de26cb138
Author: Xiang Gao 
Date:   2016-07-14T08:07:12Z

add case (c: Float, FloatType) to fromJava

commit 5cd817a4e7ec68a693ee2a878a2e36b09b1965b6
Author: Xiang Gao 
Date:   2016-07-14T08:09:25Z

sync with upstream

commit cd2ec6bc707fb6e7255b3a6a6822c3667866c63c
Author: Xiang Gao 
Date:   2016-10-17T02:44:48Z

add test for array in dataframe

commit 527d969067e447f8bff6004570c27130346cdf76
Author: Xiang Gao 
Date:   2016-10-17T03:13:47Z

merge with upstream/master

commit 82223c02082793b899c7eeca70f7bbfcea516c28
Author: Xiang Gao 
Date:   2016-10-17T03:35:47Z

set unsigned types and Py_UNICODE as unsupported

commit 0a967e280b3250bf7217e61905ad28f010c4ed40
Author: Xiang Gao 
Date:   2016-10-17T17:46:35Z

fix code style

commit 2059435b45ed1f6337a4f935adcd029084cfec91
Author: Xiang Gao 
Date:   2016-10-18T00:11:05Z

fix the same problem for byte and short

commit 58b120c4d207d9332e6dcde20109651ad8e17190
Author: Xiang Gao 
Date:   2017-06-28T01:28:03Z

sync with upstream




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

69 matches

Mail list logo