subject:"\[GitHub\] spark pull request #22795\: \[SPARK\-25798\]\[PYTHON\] Internally document type co..."

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-24 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r228012874
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|
1(int32)|
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |  boolean|  None|  True|   
True|True|True|True|True| True| 
True| True|   False|   False|   False|  
False|False|
X|   X|X|False|
False| False|  X|   False|  # noqa
+# |  tinyint|  None| 1|
  1|   1|   1|   1|   X|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  0|   X|  # noqa
+# | smallint|  None| 1|
  1|   1|   1|   1|   1|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |  int|  None| 1|
  1|   1|   1|   1|   1|1|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |   bigint|  None| 1|
  1|   1|   1|   1|   1|1|  
  1|X|   1|   1|   1|   
   0|   18|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |float|  None|   1.0|
1.0| 1.0| 1.0| 1.0| 1.0|  1.0|  
1.0|  1.0| 1.0| 1.0| 1.0|   
   X|X|

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-24 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22795


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-24 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227875311
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|
1(int32)|
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |  boolean|  None|  True|   
True|True|True|True|True| True| 
True| True|   False|   False|   False|  
False|False|
X|   X|X|False|
False| False|  X|   False|  # noqa
+# |  tinyint|  None| 1|
  1|   1|   1|   1|   X|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  0|   X|  # noqa
+# | smallint|  None| 1|
  1|   1|   1|   1|   1|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |  int|  None| 1|
  1|   1|   1|   1|   1|1|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |   bigint|  None| 1|
  1|   1|   1|   1|   1|1|  
  1|X|   1|   1|   1|   
   0|   18|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |float|  None|   1.0|
1.0| 1.0| 1.0| 1.0| 1.0|  1.0|  
1.0|  1.0| 1.0| 1.0| 1.0|   
   X|

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-23 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227634476
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|
1(int32)|
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |  boolean|  None|  True|   
True|True|True|True|True| True| 
True| True|   False|   False|   False|  
False|False|
X|   X|X|False|
False| False|  X|   False|  # noqa
+# |  tinyint|  None| 1|
  1|   1|   1|   1|   X|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  0|   X|  # noqa
+# | smallint|  None| 1|
  1|   1|   1|   1|   1|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |  int|  None| 1|
  1|   1|   1|   1|   1|1|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |   bigint|  None| 1|
  1|   1|   1|   1|   1|1|  
  1|X|   1|   1|   1|   
   0|   18|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |float|  None|   1.0|
1.0| 1.0| 1.0| 1.0| 1.0|  1.0|  
1.0|  1.0| 1.0| 1.0| 1.0|   
   X|

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-23 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227593281
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|
1(int32)|
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |  boolean|  None|  True|   
True|True|True|True|True| True| 
True| True|   False|   False|   False|  
False|False|
X|   X|X|False|
False| False|  X|   False|  # noqa
+# |  tinyint|  None| 1|
  1|   1|   1|   1|   X|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  0|   X|  # noqa
+# | smallint|  None| 1|
  1|   1|   1|   1|   1|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |  int|  None| 1|
  1|   1|   1|   1|   1|1|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |   bigint|  None| 1|
  1|   1|   1|   1|   1|1|  
  1|X|   1|   1|   1|   
   0|   18|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |float|  None|   1.0|
1.0| 1.0| 1.0| 1.0| 1.0|  1.0|  
1.0|  1.0| 1.0| 1.0| 1.0|   
   X|

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-23 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227592794
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|
1(int32)|
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |  boolean|  None|  True|   
True|True|True|True|True| True| 
True| True|   False|   False|   False|  
False|False|
X|   X|X|False|
False| False|  X|   False|  # noqa
--- End diff --

Sure, no prob. I just don't want to us to forget to update this and then it 
might look confusing and look like these are expected results


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-22 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227199025
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|
1(int32)|
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |  boolean|  None|  True|   
True|True|True|True|True| True| 
True| True|   False|   False|   False|  
False|False|
X|   X|X|False|
False| False|  X|   False|  # noqa
+# |  tinyint|  None| 1|
  1|   1|   1|   1|   X|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  0|   X|  # noqa
+# | smallint|  None| 1|
  1|   1|   1|   1|   1|X|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |  int|  None| 1|
  1|   1|   1|   1|   1|1|  
  X|X|   1|   1|   1|   
   X|X|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |   bigint|  None| 1|
  1|   1|   1|   1|   1|1|  
  1|X|   1|   1|   1|   
   0|   18|
X|   X|X|X| 
   X| X|  X|   X|  # noqa
+# |float|  None|   1.0|
1.0| 1.0| 1.0| 1.0| 1.0|  1.0|  
1.0|  1.0| 1.0| 1.0| 1.0|   
   X|

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-22 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227198050
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|
1(int32)|
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |  boolean|  None|  True|   
True|True|True|True|True| True| 
True| True|   False|   False|   False|  
False|False|
X|   X|X|False|
False| False|  X|   False|  # noqa
--- End diff --

I think it's fine .. many conversions here look buggy, for instance, 
`A(category)` with `tinyint` becomes `0` or string conversions .. 
Let's just fix, upgrade arrow and then update this chart later..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-22 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227093067
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|
1(int32)|
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+# 
+-+--+--+---+++++-+-+-++++---+-+-++-+-+-+--+---++
  # noqa
+# |  boolean|  None|  True|   
True|True|True|True|True| True| 
True| True|   False|   False|   False|  
False|False|
X|   X|X|False|
False| False|  X|   False|  # noqa
--- End diff --

Could you add a note that bool coversion is not accurate and being 
addressed in ARROW-3428?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-22 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227026198
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
--- End diff --

It's not supposed to be exposed and visible to users yet as I said in the 
comments. Let's keep it internal for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-22 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22795#discussion_r227021073
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
 conversion on returned data. The conversion is not guaranteed to 
be correct and results
 should be checked for accuracy by users.
 """
+
+# The following table shows most of Pandas data and SQL type 
conversions in Pandas UDFs that
+# are not yet visible to the user. Some of behaviors are buggy and 
might be changed in the near
+# future. The table might have to be eventually documented externally.
+# Please see SPARK-25798's PR to see the codes in order to generate 
the table below.
+#
--- End diff --

Nice code for this table generating! maybe the right place of this table 
could be 
https://github.com/apache/spark/blob/987f386588de7311b066cf0f62f0eed64d4aa7d7/docs/sql-pyspark-pandas-with-arrow.md#usage-notes


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

2018-10-22 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22795

[SPARK-25798][PYTHON] Internally document type conversion between Pandas 
data and SQL types in Pandas UDFs

## What changes were proposed in this pull request?

We are facing some problems about type conversions between Pandas data and 
SQL types in Pandas UDFs.
It's even difficult to identify the problems (see #20163 and #22610).

This PR targets to internally document the type conversion table. Some of 
them looks buggy and we should fix them.

Table can be generated via the codes below:

```python
from pyspark.sql.types import *
from pyspark.sql.functions import pandas_udf

columns = [
('none', 'object(NoneType)'),
('bool', 'bool'),
('int8', 'int8'),
('int16', 'int16'),
('int32', 'int32'),
('int64', 'int64'),
('uint8', 'uint8'),
('uint16', 'uint16'),
('uint32', 'uint32'),
('uint64', 'uint64'),
('float64', 'float16'),
('float64', 'float32'),
('float64', 'float64'),
('date', 'datetime64[ns]'),
('tz_aware_dates', 'datetime64[ns, US/Eastern]'),
('string', 'object(string)'),
('decimal', 'object(Decimal)'),
('array', 'object(array[int32])'),
('float128', 'float128'),
('complex64', 'complex64'),
('complex128', 'complex128'),
('category', 'category'),
('tdeltas', 'timedelta64[ns]'),
]

def create_dataframe():
import pandas as pd
import numpy as np
import decimal
pdf = pd.DataFrame({
'none': [None, None],
'bool': [True, False],
'int8': np.arange(1, 3).astype('int8'),
'int16': np.arange(1, 3).astype('int16'),
'int32': np.arange(1, 3).astype('int32'),
'int64': np.arange(1, 3).astype('int64'),
'uint8': np.arange(1, 3).astype('uint8'),
'uint16': np.arange(1, 3).astype('uint16'),
'uint32': np.arange(1, 3).astype('uint32'),
'uint64': np.arange(1, 3).astype('uint64'),
'float16': np.arange(1, 3).astype('float16'),
'float32': np.arange(1, 3).astype('float32'),
'float64': np.arange(1, 3).astype('float64'),
'float128': np.arange(1, 3).astype('float128'),
'complex64': np.arange(1, 3).astype('complex64'),
'complex128': np.arange(1, 3).astype('complex128'),
'string': list('ab'),
'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), 
np.array([1, 2, 3], dtype=np.int32)]),
'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]),
'date': pd.date_range('19700101', periods=2).values,
'category': pd.Series(list("AB")).astype('category')})
pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]]
pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, 
tz='US/Eastern')
return pdf

types =  [
BooleanType(),
ByteType(),
ShortType(),
IntegerType(),
LongType(),
FloatType(),
DoubleType(),
DateType(),
TimestampType(),
StringType(),
DecimalType(10, 0),
ArrayType(IntegerType()),
MapType(StringType(), IntegerType()),
StructType([StructField("_1", IntegerType())]),
BinaryType(),
]

df = spark.range(2).repartition(1)
results = []
count = 0
total = len(types) * len(columns)
values = []
spark.sparkContext.setLogLevel("FATAL")
for t in types:
result = []
for column, pandas_t in columns:
v = create_dataframe()[column][0]
values.append(v)
try:
row = df.select(pandas_udf(lambda _: 
create_dataframe()[column], t)(df.id)).first()
ret_str = repr(row[0])
except Exception:
ret_str = "X"
result.append(ret_str)
progress = "SQL Type: [%s]\n  Pandas Value(Type): %s(%s)]\n  Result 
Python Value: [%s]" % (
t.simpleString(), v, pandas_t, ret_str)
count += 1
print("%s/%s:\n  %s" % (count, total, progress))
results.append([t.simpleString()] + list(map(str, result)))


schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda 
values_column: "%s(%s)" % (values_column[0], values_column[1][1]), zip(values, 
columns)))
strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 
20, False)
print("\n".join(map(lambda line: "# %s  # noqa" % line, 
strings.strip().split("\n"

```

This code is compatible with both Python 2 and 3 but the table was 
generated under Python 2.

## How was this patch tested?

Manually tested and lint check.

You can merge this pull request into a Git repo

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

[GitHub] spark pull request #22795: [SPARK-25798][PYTHON] Internally document type co...

12 matches

Site Navigation

Mail list logo

Footer information