from:"Bryan Cutler \(JIRA\)"

[jira] [Assigned] (SPARK-38098) Add support for ArrayType of nested StructType to arrow-based conversion

2022-09-22 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-38098:


Assignee: Luca Canali

> Add support for ArrayType of nested StructType to arrow-based conversion
> 
>
> Key: SPARK-38098
> URL: https://issues.apache.org/jira/browse/SPARK-38098
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>
> This proposes to add support for ArrayType of nested StructType to 
> arrow-based conversion.
> This allows Pandas UDFs, mapInArrow UDFs, and toPandas to operate on columns 
> of type Array of Struct, via arrow serialization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38098) Add support for ArrayType of nested StructType to arrow-based conversion

2022-09-22 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-38098.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35391
[https://github.com/apache/spark/pull/35391]

> Add support for ArrayType of nested StructType to arrow-based conversion
> 
>
> Key: SPARK-38098
> URL: https://issues.apache.org/jira/browse/SPARK-38098
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.4.0
>
>
> This proposes to add support for ArrayType of nested StructType to 
> arrow-based conversion.
> This allows Pandas UDFs, mapInArrow UDFs, and toPandas to operate on columns 
> of type Array of Struct, via arrow serialization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39160) Remove workaround for ARROW-1948

2022-05-12 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-39160.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36518
[https://github.com/apache/spark/pull/36518]

> Remove workaround for ARROW-1948
> 
>
> Key: SPARK-39160
> URL: https://issues.apache.org/jira/browse/SPARK-39160
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39160) Remove workaround for ARROW-1948

2022-05-12 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-39160:


Assignee: Cheng Pan

> Remove workaround for ARROW-1948
> 
>
> Key: SPARK-39160
> URL: https://issues.apache.org/jira/browse/SPARK-39160
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type

2021-12-16 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-34521:


Assignee: Nicolas Azrak

> spark.createDataFrame does not support Pandas StringDtype extension type
> 
>
> Key: SPARK-34521
> URL: https://issues.apache.org/jira/browse/SPARK-34521
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Pavel Ganelin
>Assignee: Nicolas Azrak
>Priority: Major
> Fix For: 3.3.0
>
>
> The following test case demonstrates the problem:
> {code:java}
> import pandas as pd
> from pyspark.sql import SparkSession, types
> spark = SparkSession.builder.appName(__file__)\
> .config("spark.sql.execution.arrow.pyspark.enabled","true") \
> .getOrCreate()
> good = pd.DataFrame([["abc"]], columns=["col"])
> schema = types.StructType([types.StructField("col", types.StringType(), 
> True)])
> df = spark.createDataFrame(good, schema=schema)
> df.show()
> bad = good.copy()
> bad["col"]=bad["col"].astype("string")
> schema = types.StructType([types.StructField("col", types.StringType(), 
> True)])
> df = spark.createDataFrame(bad, schema=schema)
> df.show(){code}
> The error:
> {code:java}
> C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: 
> UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Cannot specify a mask or a size when passing an object that is converted 
> with the __arrow_array__ protocol.
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type

2021-12-15 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-34521.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34509
[https://github.com/apache/spark/pull/34509]

> spark.createDataFrame does not support Pandas StringDtype extension type
> 
>
> Key: SPARK-34521
> URL: https://issues.apache.org/jira/browse/SPARK-34521
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Pavel Ganelin
>Priority: Major
> Fix For: 3.3.0
>
>
> The following test case demonstrates the problem:
> {code:java}
> import pandas as pd
> from pyspark.sql import SparkSession, types
> spark = SparkSession.builder.appName(__file__)\
> .config("spark.sql.execution.arrow.pyspark.enabled","true") \
> .getOrCreate()
> good = pd.DataFrame([["abc"]], columns=["col"])
> schema = types.StructType([types.StructField("col", types.StringType(), 
> True)])
> df = spark.createDataFrame(good, schema=schema)
> df.show()
> bad = good.copy()
> bad["col"]=bad["col"].astype("string")
> schema = types.StructType([types.StructField("col", types.StringType(), 
> True)])
> df = spark.createDataFrame(bad, schema=schema)
> df.show(){code}
> The error:
> {code:java}
> C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: 
> UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Cannot specify a mask or a size when passing an object that is converted 
> with the __arrow_array__ protocol.
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2021-10-28 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Attachment: (was: 0--1172099527-254246775-1412485878)

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.1.0
>
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: -Struct-, -Array-, -Map-
>  * -*Decimal*-
>  * -*Binary*-
>  * -*Categorical*- when converting from Pandas
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34463) toPandas failed with error: buffer source array is read-only when Arrow with self-destruct is enabled

2021-03-02 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293893#comment-17293893
 ] 

Bryan Cutler edited comment on SPARK-34463 at 3/2/21, 6:11 PM:
---

As David said, it depends on what is done in Pandas that might lead to this. 
I'm not sure why `value_counts()` would cause this error, but other operations 
should work. I believe you could also workaround by making a copy of the 
DataFrame yourself. I think this example shows that the self destruct feature 
should be clearly documented to be experimental and only used if you know 
absolutely what you are doing.


was (Author: bryanc):
As David said, it depends on what is done in Pandas that might lead to this. 
I'm not sure why `value_counts()` would cause this error, but other operations 
should work. I believe you could also workaround by making a copy of the 
DataFrame yourself. I think this example shows that the self descruct feature 
should be clearly documented to be experimental and only used if you know 
absolutely what you are doing.

> toPandas failed with error: buffer source array is read-only when Arrow with 
> self-destruct is enabled
> -
>
> Key: SPARK-34463
> URL: https://issues.apache.org/jira/browse/SPARK-34463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Priority: Major
>
> Environment:
> apache/spark master 
>  pandas version > 1.0.5
> Reproduce code:
> {code:java}
> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
> spark.conf.set('spark.sql.execution.arrow.pyspark.selfDestruct.enabled', True)
> spark.createDataFrame(sc.parallelize([(i,) for i in range(13)], 1), 'id 
> long').selectExpr('IF(id % 3==0, id+1, NULL) AS f1', '(id+1) % 2 AS 
> label').toPandas()['label'].value_counts()
> {code}
> Get error like:
> {quote}Traceback (most recent call last): 
>  File "", line 1, in 
>  File 
> "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/base.py",
>  line 1033, in value_counts
>  dropna=dropna,
>  File 
> "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/algorithms.py",
>  line 820, in value_counts
>  keys, counts = value_counts_arraylike(values, dropna)
>  File 
> "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/algorithms.py",
>  line 865, in value_counts_arraylike
>  keys, counts = f(values, dropna)
>  File "pandas/_libs/hashtable_func_helper.pxi", line 1098, in 
> pandas._libs.hashtable.value_count_int64
>  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
>  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
>  ValueError: buffer source array is read-only
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34463) toPandas failed with error: buffer source array is read-only when Arrow with self-destruct is enabled

2021-03-02 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293893#comment-17293893
 ] 

Bryan Cutler commented on SPARK-34463:
--

As David said, it depends on what is done in Pandas that might lead to this. 
I'm not sure why `value_counts()` would cause this error, but other operations 
should work. I believe you could also workaround by making a copy of the 
DataFrame yourself. I think this example shows that the self descruct feature 
should be clearly documented to be experimental and only used if you know 
absolutely what you are doing.

> toPandas failed with error: buffer source array is read-only when Arrow with 
> self-destruct is enabled
> -
>
> Key: SPARK-34463
> URL: https://issues.apache.org/jira/browse/SPARK-34463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Priority: Major
>
> Environment:
> apache/spark master 
>  pandas version > 1.0.5
> Reproduce code:
> {code:java}
> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
> spark.conf.set('spark.sql.execution.arrow.pyspark.selfDestruct.enabled', True)
> spark.createDataFrame(sc.parallelize([(i,) for i in range(13)], 1), 'id 
> long').selectExpr('IF(id % 3==0, id+1, NULL) AS f1', '(id+1) % 2 AS 
> label').toPandas()['label'].value_counts()
> {code}
> Get error like:
> {quote}Traceback (most recent call last): 
>  File "", line 1, in 
>  File 
> "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/base.py",
>  line 1033, in value_counts
>  dropna=dropna,
>  File 
> "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/algorithms.py",
>  line 820, in value_counts
>  keys, counts = value_counts_arraylike(values, dropna)
>  File 
> "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/algorithms.py",
>  line 865, in value_counts_arraylike
>  keys, counts = f(values, dropna)
>  File "pandas/_libs/hashtable_func_helper.pxi", line 1098, in 
> pandas._libs.hashtable.value_count_int64
>  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
>  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
>  ValueError: buffer source array is read-only
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32953) Lower memory usage in toPandas with Arrow self_destruct

2021-02-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-32953:


Assignee: David Li

> Lower memory usage in toPandas with Arrow self_destruct
> ---
>
> Key: SPARK-32953
> URL: https://issues.apache.org/jira/browse/SPARK-32953
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: David Li
>Assignee: David Li
>Priority: Major
> Fix For: 3.2.0
>
>
> As described on the mailing list:
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html]
> toPandas() can as much as double memory usage as both Arrow and Pandas retain 
> a copy of a dataframe in memory during the conversion. Arrow >= 0.16 offers a 
> self_destruct mode that avoids this with some caveats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32953) Lower memory usage in toPandas with Arrow self_destruct

2021-02-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-32953.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 29818
[https://github.com/apache/spark/pull/29818]

> Lower memory usage in toPandas with Arrow self_destruct
> ---
>
> Key: SPARK-32953
> URL: https://issues.apache.org/jira/browse/SPARK-32953
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: David Li
>Priority: Major
> Fix For: 3.2.0
>
>
> As described on the mailing list:
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html]
> toPandas() can as much as double memory usage as both Arrow and Pandas retain 
> a copy of a dataframe in memory during the conversion. Arrow >= 0.16 offers a 
> self_destruct mode that avoids this with some caveats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2020-12-28 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255712#comment-17255712
 ] 

Bryan Cutler commented on SPARK-24632:
--

Ping [~huaxingao] in case you have some time to look into this.

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the 
> doc linked below.  Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers 
> implement a trait which provides the corresponding Python classpath in some 
> field:
> {code}
> trait PythonWrappable {
>   def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.  
> They would ideally test a Java class + Python wrapper class pair sitting 
> outside of pyspark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.

2020-12-11 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-33576.
--
Resolution: Duplicate

Going to resolve as a duplicate, but please reopen if you find it is different

> PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC 
> message: negative bodyLength'.
> -
>
> Key: SPARK-33576
> URL: https://issues.apache.org/jira/browse/SPARK-33576
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
> Environment: Databricks runtime 7.3
> Spakr 3.0.1
> Scala 2.12
>Reporter: Darshat
>Priority: Major
>
> Hello,
> We are using Databricks on Azure to process large amount of ecommerce data. 
> Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12.
> During processing, there is a groupby operation on the DataFrame that 
> consistently gets an exception of this type:
>  
> {color:#ff}PythonException: An exception was thrown from a UDF: 'OSError: 
> Invalid IPC message: negative bodyLength'. Full traceback below: Traceback 
> (most recent call last): File "/databricks/spark/python/pyspark/worker.py", 
> line 654, in main process() File 
> "/databricks/spark/python/pyspark/worker.py", line 646, in process 
> serializer.dump_stream(out_iter, outfile) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in 
> dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in 
> dump_stream for batch in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in 
> init_stream_yield_batches for series in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in 
> load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in 
> __iter__ File "pyarrow/ipc.pxi", line 432, in 
> pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", 
> line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative 
> bodyLength{color}
>  
> Code that causes this:
> {color:#ff}x = df.groupby('providerid').apply(domain_features){color}
> {color:#ff}display(x.info()){color}
> Dataframe size - 22 million rows, 31 columns
>  One of the columns is a string ('providerid') on which we do a groupby 
> followed by an apply  operation. There are 3 distinct provider ids in this 
> set. While trying to enumerate/count the results, we get this exception.
> We've put all possible checks in the code for null values, or corrupt data 
> and we are not able to track this to application level code. I hope we can 
> get some help troubleshooting this as this is a blocker for rolling out at 
> scale.
> The cluster has 8 nodes + driver, all 28GB RAM. I can provide any other 
> settings that could be useful. 
>  Hope to get some insights into the problem. 
> Thanks,
> Darshat Shah



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.

2020-12-11 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248064#comment-17248064
 ] 

Bryan Cutler commented on SPARK-33576:
--

[~darshats] I believe the only current workaround is to further split your 
groups with other keys to get under the 2GB limit. To take advantage of the new 
Arrow improvements for this would most likely require some work on the Spark 
side, but I'd have to look into it more.

> PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC 
> message: negative bodyLength'.
> -
>
> Key: SPARK-33576
> URL: https://issues.apache.org/jira/browse/SPARK-33576
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
> Environment: Databricks runtime 7.3
> Spakr 3.0.1
> Scala 2.12
>Reporter: Darshat
>Priority: Major
>
> Hello,
> We are using Databricks on Azure to process large amount of ecommerce data. 
> Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12.
> During processing, there is a groupby operation on the DataFrame that 
> consistently gets an exception of this type:
>  
> {color:#ff}PythonException: An exception was thrown from a UDF: 'OSError: 
> Invalid IPC message: negative bodyLength'. Full traceback below: Traceback 
> (most recent call last): File "/databricks/spark/python/pyspark/worker.py", 
> line 654, in main process() File 
> "/databricks/spark/python/pyspark/worker.py", line 646, in process 
> serializer.dump_stream(out_iter, outfile) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in 
> dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in 
> dump_stream for batch in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in 
> init_stream_yield_batches for series in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in 
> load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in 
> __iter__ File "pyarrow/ipc.pxi", line 432, in 
> pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", 
> line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative 
> bodyLength{color}
>  
> Code that causes this:
> {color:#ff}x = df.groupby('providerid').apply(domain_features){color}
> {color:#ff}display(x.info()){color}
> Dataframe size - 22 million rows, 31 columns
>  One of the columns is a string ('providerid') on which we do a groupby 
> followed by an apply  operation. There are 3 distinct provider ids in this 
> set. While trying to enumerate/count the results, we get this exception.
> We've put all possible checks in the code for null values, or corrupt data 
> and we are not able to track this to application level code. I hope we can 
> get some help troubleshooting this as this is a blocker for rolling out at 
> scale.
> The cluster has 8 nodes + driver, all 28GB RAM. I can provide any other 
> settings that could be useful. 
>  Hope to get some insights into the problem. 
> Thanks,
> Darshat Shah



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33576) PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.

2020-12-01 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241769#comment-17241769
 ] 

Bryan Cutler commented on SPARK-33576:
--

Is this due to the 2GB limit? As in 
https://issues.apache.org/jira/browse/SPARK-32294 and 
https://issues.apache.org/jira/browse/ARROW-4890

> PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC 
> message: negative bodyLength'.
> -
>
> Key: SPARK-33576
> URL: https://issues.apache.org/jira/browse/SPARK-33576
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
> Environment: Databricks runtime 7.3
> Spakr 3.0.1
> Scala 2.12
>Reporter: Darshat
>Priority: Major
>
> Hello,
> We are using Databricks on Azure to process large amount of ecommerce data. 
> Databricks runtime is 7.3 which includes Apache spark 3.0.1 and Scala 2.12.
> During processing, there is a groupby operation on the DataFrame that 
> consistently gets an exception of this type:
>  
> {color:#ff}PythonException: An exception was thrown from a UDF: 'OSError: 
> Invalid IPC message: negative bodyLength'. Full traceback below: Traceback 
> (most recent call last): File "/databricks/spark/python/pyspark/worker.py", 
> line 654, in main process() File 
> "/databricks/spark/python/pyspark/worker.py", line 646, in process 
> serializer.dump_stream(out_iter, outfile) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 281, in 
> dump_stream timely_flush_timeout_ms=self.timely_flush_timeout_ms) File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 97, in 
> dump_stream for batch in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 271, in 
> init_stream_yield_batches for series in iterator: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 287, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 228, in 
> load_stream for batch in batches: File 
> "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 118, in 
> load_stream for batch in reader: File "pyarrow/ipc.pxi", line 412, in 
> __iter__ File "pyarrow/ipc.pxi", line 432, in 
> pyarrow.lib._CRecordBatchReader.read_next_batch File "pyarrow/error.pxi", 
> line 99, in pyarrow.lib.check_status OSError: Invalid IPC message: negative 
> bodyLength{color}
>  
> Code that causes this:
> {color:#ff}x = df.groupby('providerid').apply(domain_features){color}
> {color:#ff}display(x.info()){color}
> Dataframe size - 22 million rows, 31 columns
>  One of the columns is a string ('providerid') on which we do a groupby 
> followed by an apply  operation. There are 3 distinct provider ids in this 
> set. While trying to enumerate/count the results, we get this exception.
> We've put all possible checks in the code for null values, or corrupt data 
> and we are not able to track this to application level code. I hope we can 
> get some help troubleshooting this as this is a blocker for rolling out at 
> scale.
> The cluster has 8 nodes + driver, all 28GB RAM. I can provide any other 
> settings that could be useful. 
>  Hope to get some insights into the problem. 
> Thanks,
> Darshat Shah



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33489) Support null for conversion from and to Arrow type

2020-11-30 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241092#comment-17241092
 ] 

Bryan Cutler commented on SPARK-33489:
--

Great, thanks [~cactice] ! Please feel free to ping me if you need help 
implementing or reviewing.

> Support null for conversion from and to Arrow type
> --
>
> Key: SPARK-33489
> URL: https://issues.apache.org/jira/browse/SPARK-33489
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Yuya Kanai
>Priority: Minor
>
> I got below error when using from_arrow_type() in pyspark.sql.pandas.types
> {{Unsupported type in conversion from Arrow: null}}
> I noticed NullType exists under pyspark.sql.types so it seems possible to 
> convert from pyarrow null to pyspark null type and vice versa.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33613) [Python][Tests] Replace calls to deprecated test APIs

2020-11-30 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-33613:


 Summary: [Python][Tests] Replace calls to deprecated test APIs
 Key: SPARK-33613
 URL: https://issues.apache.org/jira/browse/SPARK-33613
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Tests
Affects Versions: 3.0.0
Reporter: Bryan Cutler


There are quite a few instances of using deprecated python and pandas APIs for 
pyspark tests. These should be replaced by the preferred APIs which have been 
available for some time. For example in unittests: `assertRaisesRegexp()` -> 
`assertRaisesRegex()`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33489) Support null for conversion from and to Arrow type

2020-11-25 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238950#comment-17238950
 ] 

Bryan Cutler commented on SPARK-33489:
--

Yes, Arrow supports null type. Should be pretty straightforward to add in Scala 
and Python. Is this something you are interested in working on adding 
[~cactice] ?

> Support null for conversion from and to Arrow type
> --
>
> Key: SPARK-33489
> URL: https://issues.apache.org/jira/browse/SPARK-33489
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Yuya Kanai
>Priority: Minor
>
> I got below error when using from_arrow_type() in pyspark.sql.pandas.types
> {{Unsupported type in conversion from Arrow: null}}
> I noticed NullType exists under pyspark.sql.types so it seems possible to 
> convert from pyarrow null to pyspark null type and vice versa.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2020-11-20 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-21187.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

With MapType now added, all basic types are supported. I changed nested 
timestamps/dates to a separate issue and I think we can resolve this now.

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.1.0
>
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: -Struct-, -Array-, -Map-
>  * -*Decimal*-
>  * -*Binary*-
>  * -*Categorical*- when converting from Pandas
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2020-11-20 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Description: 
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: -Struct-, -Array-, -Map-
 * -*Decimal*-
 * -*Binary*-
 * -*Categorical*- when converting from Pandas

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * -Need to add some user docs-
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?

  was:
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
 * -*Decimal*-
 * -*Binary*-
 * -*Categorical*- when converting from Pandas

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * -Need to add some user docs-
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?


> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: -Struct-, -Array-, -Map-
>  * -*Decimal*-
>  * -*Binary*-
>  * -*Categorical*- when converting from Pandas
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32285) Add PySpark support for nested timestamps with arrow

2020-11-20 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-32285:
-
Parent: (was: SPARK-21187)
Issue Type: Improvement  (was: Sub-task)

> Add PySpark support for nested timestamps with arrow
> 
>
> Key: SPARK-32285
> URL: https://issues.apache.org/jira/browse/SPARK-32285
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently with arrow optimizations, there is post-processing done in pandas 
> for timestamp columns to localize timezone. This is not done for nested 
> columns with timestamps such as StructType or ArrayType.
> Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use 
> of structs with timestamps in groupedby key over a window.
> As a simple first step, timestamps with 1 level nesting could be done first 
> and this will satisfy the immediate need.
> NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing 
> with pyarrow.array.cast, which could be easier done than in pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency

2020-11-01 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224420#comment-17224420
 ] 

Bryan Cutler edited comment on SPARK-33279 at 11/2/20, 5:21 AM:


[~fan_li_ya] we should change the Arrow-Spark integration tests so that it 
doesn't try to build with the latest Arrow Java, and instead just test the 
latest pyarrow, which should work. I made ARROW-10457 for this.


was (Author: bryanc):
[~fan_li_ya] we should change the Arrow-Spark integration tests so that it 
doesn't try to build with the latest Arrow Java, and instead just test the 
latest pyarrow, which should work.

> Spark 3.0 failure due to lack of Arrow dependency
> -
>
> Key: SPARK-33279
> URL: https://issues.apache.org/jira/browse/SPARK-33279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> A recent change in Arrow has split the arrow-memory module into 3, so client 
> code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe).
> This has been done in the master branch of Spark, but not in the branch-3.0 
> branch, this is causing the build in branch-3.0 to fail 
> (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency

2020-11-01 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224420#comment-17224420
 ] 

Bryan Cutler commented on SPARK-33279:
--

[~fan_li_ya] we should change the Arrow-Spark integration tests so that it 
doesn't try to build with the latest Arrow Java, and instead just test the 
latest pyarrow, which should work.

> Spark 3.0 failure due to lack of Arrow dependency
> -
>
> Key: SPARK-33279
> URL: https://issues.apache.org/jira/browse/SPARK-33279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> A recent change in Arrow has split the arrow-memory module into 3, so client 
> code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe).
> This has been done in the master branch of Spark, but not in the branch-3.0 
> branch, this is causing the build in branch-3.0 to fail 
> (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33213) Upgrade Apache Arrow to 2.0.0

2020-10-23 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219840#comment-17219840
 ] 

Bryan Cutler commented on SPARK-33213:
--

Just a couple notes:

The library and format versions are now split, the format version is still at 
1.0.0 so remains binary compatible, see here for more info 
[https://arrow.apache.org/blog/2020/10/22/2.0.0-release/]

I don't think there are any relevant changes in Arrow Java between 1.0.1 and 
2.0.0, and pyspark is currently working with pyarrow 2.0.0 with the added env 
var {{PYARROW_IGNORE_TIMEZONE=1}}

> Upgrade Apache Arrow to 2.0.0
> -
>
> Key: SPARK-33213
> URL: https://issues.apache.org/jira/browse/SPARK-33213
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Minor
>
> Apache Arrow 2.0.0 has [just been 
> released|https://cwiki.apache.org/confluence/display/ARROW/Arrow+2.0.0+Release].
>  This proposes to upgrade Spark's Arrow dependency to use 2.0.0, from the 
> current 1.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33189) Support PyArrow 2.0.0+

2020-10-20 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217779#comment-17217779
 ] 

Bryan Cutler commented on SPARK-33189:
--

There is an env var we can set that will use the old behavior and fix this. I 
can do a PR for it soon, unless someone else is able to get to it first.

> Support PyArrow 2.0.0+
> --
>
> Key: SPARK-33189
> URL: https://issues.apache.org/jira/browse/SPARK-33189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Some tests fail with PyArrow 2.0.0 in PySpark:
> {code}
> ==
> ERROR [0.774s]: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 595, in test_grouped_over_window_with_key
> .select('id', 'result').collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 255, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 81, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 248, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, 
> in wrapped
> result = f(key, pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in 
> wrapper
> return f(*args, **kwargs)
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 590, in f
> "{} != {}".format(expected_key[i][1], window_range)
> AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': 
> datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 
> 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, 
> 20, 0, 0, tzinfo=)}
> {code}
> We should verify and support PyArrow 2.0.0+
> See also https://github.com/apache/spark/runs/1278918780



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33073) Improve error handling on Pandas to Arrow conversion failures

2020-10-05 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-33073:
-
Description: 
Currently, when converting from Pandas to Arrow for Pandas UDF return values or 
from createDataFrame(), PySpark will catch all ArrowExceptions and display info 
on how to disable the safe conversion config. This is displayed with the 
original error as a tuple:

{noformat}
('Exception thrown when converting pandas.Series (object) to Arrow Array 
(int32). It can be caused by overflows or other unsafe conversions warned by 
Arrow. Arrow safe type check can be disabled by using SQL config 
`spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could 
not convert a with type str: tried to convert to int'))
{noformat}

The problem is that this is meant mainly for thing like float truncation or 
overflow, but will also show if the user has an invalid schema with types that 
are incompatible. The extra information is confusing in this case and the real 
error is buried.

This could be improved by only printing the extra info on how to disable safe 
checking if the config is actually set and using exception chaining to better 
show the original error. Also, any safe failures would be a ValueError, which 
ArrowInvaildError is a subclass, so the catch could be made more narrow.

  was:
Currently, when converting from Pandas to Arrow for Pandas UDF return values or 
from createDataFrame(), PySpark will catch all ArrowExceptions and display info 
on how to disable the safe conversion config. This is displayed with the 
original error as a tuple:

{noformat}
('Exception thrown when converting pandas.Series (object) to Arrow Array 
(int32). It can be caused by overflows or other unsafe conversions warned by 
Arrow. Arrow safe type check can be disabled by using SQL config 
`spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could 
not convert a with type str: tried to convert to int'))
{noformat}

The problem is that this is meant mainly for thing like float truncation or 
overflow, but will also show if the user has an invalid schema with types that 
are incompatible. The extra information is confusing in this case and the real 
error is buried.

This could be improved by only printing the extra info on how to disable safe 
checking if the config is actually set. Also, any safe failures would be a 
ValueError, which ArrowInvaildError is a subclass, so the catch could be made 
more narrow.


> Improve error handling on Pandas to Arrow conversion failures
> -
>
> Key: SPARK-33073
> URL: https://issues.apache.org/jira/browse/SPARK-33073
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, when converting from Pandas to Arrow for Pandas UDF return values 
> or from createDataFrame(), PySpark will catch all ArrowExceptions and display 
> info on how to disable the safe conversion config. This is displayed with the 
> original error as a tuple:
> {noformat}
> ('Exception thrown when converting pandas.Series (object) to Arrow Array 
> (int32). It can be caused by overflows or other unsafe conversions warned by 
> Arrow. Arrow safe type check can be disabled by using SQL config 
> `spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could 
> not convert a with type str: tried to convert to int'))
> {noformat}
> The problem is that this is meant mainly for thing like float truncation or 
> overflow, but will also show if the user has an invalid schema with types 
> that are incompatible. The extra information is confusing in this case and 
> the real error is buried.
> This could be improved by only printing the extra info on how to disable safe 
> checking if the config is actually set and using exception chaining to better 
> show the original error. Also, any safe failures would be a ValueError, which 
> ArrowInvaildError is a subclass, so the catch could be made more narrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33073) Improve error handling on Pandas to Arrow conversion failures

2020-10-05 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-33073:


 Summary: Improve error handling on Pandas to Arrow conversion 
failures
 Key: SPARK-33073
 URL: https://issues.apache.org/jira/browse/SPARK-33073
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.1
Reporter: Bryan Cutler


Currently, when converting from Pandas to Arrow for Pandas UDF return values or 
from createDataFrame(), PySpark will catch all ArrowExceptions and display info 
on how to disable the safe conversion config. This is displayed with the 
original error as a tuple:

{noformat}
('Exception thrown when converting pandas.Series (object) to Arrow Array 
(int32). It can be caused by overflows or other unsafe conversions warned by 
Arrow. Arrow safe type check can be disabled by using SQL config 
`spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could 
not convert a with type str: tried to convert to int'))
{noformat}

The problem is that this is meant mainly for thing like float truncation or 
overflow, but will also show if the user has an invalid schema with types that 
are incompatible. The extra information is confusing in this case and the real 
error is buried.

This could be improved by only printing the extra info on how to disable safe 
checking if the config is actually set. Also, any safe failures would be a 
ValueError, which ArrowInvaildError is a subclass, so the catch could be made 
more narrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24554) Add MapType Support for Arrow in PySpark

2020-10-01 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205719#comment-17205719
 ] 

Bryan Cutler commented on SPARK-24554:
--

I started working on this, but ran into an issue at 
https://issues.apache.org/jira/browse/ARROW-10151 which needs to be resolved 
first.

> Add MapType Support for Arrow in PySpark
> 
>
> Key: SPARK-24554
> URL: https://issues.apache.org/jira/browse/SPARK-24554
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: bulk-closed
>
> Add support for MapType in Arrow related classes in Scala/Java and pyarrow 
> functionality in Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0

2020-09-03 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190553#comment-17190553
 ] 

Bryan Cutler commented on SPARK-32312:
--

Sorry for the delay, I was holding off for a couple of things. Version 1.0.1 
was being released and wanted to see if there was anything crucial to use that 
version instead. Also, to see if ARROW-9528 would be included in the 1.0.1 
release, which would have broken our Python tests, but it will be in v2.0.0. 
For Arrow Java, there were no real fixes between 1.0.0 and 1.0.1, so it won't 
really make a difference, but might as well use 1.0.1. For pyarrow, there were 
no fixes that affect our Spark usage, so it should be safe to use a minimum 
version of 1.0.0. I will try to put up a PR tomorrow.

> Upgrade Apache Arrow to 1.0.0
> -
>
> Key: SPARK-32312
> URL: https://issues.apache.org/jira/browse/SPARK-32312
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Apache Arrow will soon release v1.0.0 which provides backward/forward 
> compatibility guarantees as well as a number of fixes and improvements. This 
> will upgrade the Java artifact and PySpark API. Although PySpark will not 
> need special changes, it might be a good idea to bump up minimum supported 
> version and CI testing.
> TBD: list of important improvements and fixes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32686) Un-deprecate inferring DataFrame schema from list of dictionaries

2020-08-24 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-32686:


Assignee: Nicholas Chammas

> Un-deprecate inferring DataFrame schema from list of dictionaries
> -
>
> Key: SPARK-32686
> URL: https://issues.apache.org/jira/browse/SPARK-32686
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>
> Inferring the schema of a DataFrame from a list of dictionaries feels natural 
> for PySpark users, and also mirrors [basic functionality in 
> Pandas|https://stackoverflow.com/a/20638258/877069].
> This is currently possible in PySpark but comes with a deprecation warning. 
> We should un-deprecate this behavior if there are no deeper reasons to 
> discourage users from this API, beyond wanting to push them to use {{Row}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32686) Un-deprecate inferring DataFrame schema from list of dictionaries

2020-08-24 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-32686.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29510
[https://github.com/apache/spark/pull/29510]

> Un-deprecate inferring DataFrame schema from list of dictionaries
> -
>
> Key: SPARK-32686
> URL: https://issues.apache.org/jira/browse/SPARK-32686
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 3.1.0
>
>
> Inferring the schema of a DataFrame from a list of dictionaries feels natural 
> for PySpark users, and also mirrors [basic functionality in 
> Pandas|https://stackoverflow.com/a/20638258/877069].
> This is currently possible in PySpark but comes with a deprecation warning. 
> We should un-deprecate this behavior if there are no deeper reasons to 
> discourage users from this API, beyond wanting to push them to use {{Row}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-32413) Guidance for my project

2020-07-23 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler closed SPARK-32413.


> Guidance for my project 
> 
>
> Key: SPARK-32413
> URL: https://issues.apache.org/jira/browse/SPARK-32413
> Project: Spark
>  Issue Type: Brainstorming
>  Components: PySpark, Spark Core, SparkR
>Affects Versions: 3.0.0
>Reporter: Suat Toksoz
>Priority: Minor
>
> hi,
> I am planning to get-read elasticsearch index continuously, and put that data 
> on Data Frame and group that data, search and create an alert. I like to 
> write my code in python.
> For this purpose, what should I use, spark, jupter, pyspark... 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32413) Guidance for my project

2020-07-23 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-32413.
--
Resolution: Not A Problem

Hi [~stoksoz] , this type of discussion is more appropriate for the mailing 
list, see [https://spark.apache.org/community.html] on how to subscribe

> Guidance for my project 
> 
>
> Key: SPARK-32413
> URL: https://issues.apache.org/jira/browse/SPARK-32413
> Project: Spark
>  Issue Type: Brainstorming
>  Components: PySpark, Spark Core, SparkR
>Affects Versions: 3.0.0
>Reporter: Suat Toksoz
>Priority: Minor
>
> hi,
> I am planning to get-read elasticsearch index continuously, and put that data 
> on Data Frame and group that data, search and create an alert. I like to 
> write my code in python.
> For this purpose, what should I use, spark, jupter, pyspark... 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32300) toPandas with no partitions should work

2020-07-14 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-32300:


Assignee: Hyukjin Kwon

> toPandas with no partitions should work
> ---
>
> Key: SPARK-32300
> URL: https://issues.apache.org/jira/browse/SPARK-32300
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {code}
> >>> spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()
>   An error occurred while calling o158.getResult.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
>   at 
> org.apache.spark.api.python.PythonServer.getResult(PythonRDD.scala:874)
>   at 
> org.apache.spark.api.python.PythonServer.getResult(PythonRDD.scala:870)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3293)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3287)
>   at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3369)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1.apply(Dataset.scala:3287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1.apply(Dataset.scala:3286)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$7$$anonfun$apply$3.apply$mcV$sp(PythonRDD.scala:456)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$7$$anonfun$apply$3.apply(PythonRDD.scala:456)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$7$$anonfun$apply$3.apply(PythonRDD.scala:456)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$7.apply(PythonRDD.scala:457)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$7.apply(PythonRDD.scala:453)
>   at 
> org.apache.spark.api.python.SocketFuncServer.handleConnection(PythonRDD.scala:994)
>   at 
> org.apache.spark.api.python.SocketFuncServer.handleConnection(PythonRDD.scala:988)
>   at 
> org.apache.spark.api.python.PythonServer$$anonfun$11$$anonfun$apply$9.apply(PythonRDD.scala:853)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.api.python.PythonServer$$anonfun$11.apply(PythonRDD.scala:853)
>   at 
> org.apache.spark.api.python.PythonServer$$anonfun$11.apply(PythonRDD.scala:852)
>   at 
> org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:908)
>   warnings.warn(msg)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/dataframe.py", 
> line 2132, in toPandas
> batches = self.toDF(*tmp_column_names)._collectAsArrow()
>   File 
> "/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/dataframe.py", 
> line 2223, in _collectAsArrow
> jsocket_auth_server.getResult()  # Join serving thread and raise any 
> exceptions
>   File 
> "/Users/hyukjin.kwon/workspace/forked/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/hyukjin.kwon/workspace/forked/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJav

[jira] [Resolved] (SPARK-32300) toPandas with no partitions should work

2020-07-14 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-32300.
--
Fix Version/s: 2.4.7
   Resolution: Fixed

Issue resolved by pull request 29098
[https://github.com/apache/spark/pull/29098]

> toPandas with no partitions should work
> ---
>
> Key: SPARK-32300
> URL: https://issues.apache.org/jira/browse/SPARK-32300
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.7
>
>
> {code}
> >>> spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()
>   An error occurred while calling o158.getResult.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
>   at 
> org.apache.spark.api.python.PythonServer.getResult(PythonRDD.scala:874)
>   at 
> org.apache.spark.api.python.PythonServer.getResult(PythonRDD.scala:870)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3293)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3287)
>   at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3369)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1.apply(Dataset.scala:3287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1.apply(Dataset.scala:3286)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$7$$anonfun$apply$3.apply$mcV$sp(PythonRDD.scala:456)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$7$$anonfun$apply$3.apply(PythonRDD.scala:456)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$7$$anonfun$apply$3.apply(PythonRDD.scala:456)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$7.apply(PythonRDD.scala:457)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$7.apply(PythonRDD.scala:453)
>   at 
> org.apache.spark.api.python.SocketFuncServer.handleConnection(PythonRDD.scala:994)
>   at 
> org.apache.spark.api.python.SocketFuncServer.handleConnection(PythonRDD.scala:988)
>   at 
> org.apache.spark.api.python.PythonServer$$anonfun$11$$anonfun$apply$9.apply(PythonRDD.scala:853)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.api.python.PythonServer$$anonfun$11.apply(PythonRDD.scala:853)
>   at 
> org.apache.spark.api.python.PythonServer$$anonfun$11.apply(PythonRDD.scala:852)
>   at 
> org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:908)
>   warnings.warn(msg)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/dataframe.py", 
> line 2132, in toPandas
> batches = self.toDF(*tmp_column_names)._collectAsArrow()
>   File 
> "/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/dataframe.py", 
> line 2223, in _collectAsArrow
> jsocket_auth_server.getResult()  # Join serving thread and raise any 
> exceptions
>   File 
> "/Users/hyukjin.kwon/workspace/forked/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/hyukjin.kw

[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0

2020-07-14 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157543#comment-17157543
 ] 

Bryan Cutler commented on SPARK-32312:
--

I've been doing local testing and will submit a WIP PR soon. The release should 
be available in ~1-2 weeks, depending on last minute issues.

> Upgrade Apache Arrow to 1.0.0
> -
>
> Key: SPARK-32312
> URL: https://issues.apache.org/jira/browse/SPARK-32312
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Apache Arrow will soon release v1.0.0 which provides backward/forward 
> compatibility guarantees as well as a number of fixes and improvements. This 
> will upgrade the Java artifact and PySpark API. Although PySpark will not 
> need special changes, it might be a good idea to bump up minimum supported 
> version and CI testing.
> TBD: list of important improvements and fixes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32312) Upgrade Apache Arrow to 1.0.0

2020-07-14 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-32312:


 Summary: Upgrade Apache Arrow to 1.0.0
 Key: SPARK-32312
 URL: https://issues.apache.org/jira/browse/SPARK-32312
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Bryan Cutler


Apache Arrow will soon release v1.0.0 which provides backward/forward 
compatibility guarantees as well as a number of fixes and improvements. This 
will upgrade the Java artifact and PySpark API. Although PySpark will not need 
special changes, it might be a good idea to bump up minimum supported version 
and CI testing.

TBD: list of important improvements and fixes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2020-07-12 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Description: 
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
 * -*Decimal*-
 * -*Binary*-
 * -*Categorical*- when converting from Pandas

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * -Need to add some user docs-
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?

  was:
This is to track adding the remaining type support in Arrow Converters. 
Currently, only primitive data types are supported. '

Remaining types:
 * -*Date*-
 * -*Timestamp*-
 * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
 * -*Decimal*-
 * -*Binary*-
* Categorical when converting from Pandas

Some things to do before closing this out:
 * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
 * -Need to add some user docs-
 * -Make sure Python tests are thorough-
 * Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?


> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> This is to track adding the remaining type support in Arrow Converters. 
> Currently, only primitive data types are supported. '
> Remaining types:
>  * -*Date*-
>  * -*Timestamp*-
>  * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map
>  * -*Decimal*-
>  * -*Binary*-
>  * -*Categorical*- when converting from Pandas
> Some things to do before closing this out:
>  * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
>  * -Need to add some user docs-
>  * -Make sure Python tests are thorough-
>  * Check into complex type support mentioned in comments by [~leif], should 
> we support mulit-indexing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32285) Add PySpark support for nested timestamps with arrow

2020-07-12 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-32285:


 Summary: Add PySpark support for nested timestamps with arrow
 Key: SPARK-32285
 URL: https://issues.apache.org/jira/browse/SPARK-32285
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Bryan Cutler


Currently with arrow optimizations, there is post-processing done in pandas for 
timestamp columns to localize timezone. This is not done for nested columns 
with timestamps such as StructType or ArrayType.

Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use of 
structs with timestamps in groupedby key over a window.

As a simple first step, timestamps with 1 level nesting could be done first and 
this will satisfy the immediate need.

NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing with 
pyarrow.array.cast, which could be easier done than in pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32174) toPandas attempted Arrow optimization but has reached an error and can not continue

2020-07-08 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-32174.
--
Resolution: Not A Problem

Great, I will mark this as resolved then.  We should add the configuration 
example you used to the docs somewhere as well since I'm sure others will hit 
this.

> toPandas attempted Arrow optimization but has reached an error and can not 
> continue
> ---
>
> Key: SPARK-32174
> URL: https://issues.apache.org/jira/browse/SPARK-32174
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, running in *stand-alone* mode
>Reporter: Ramin Hazegh
>Priority: Major
>
> h4. Converting a dataframe to Panda data frame using toPandas() fails.
>  
> *Spark 3.0.0 Running in stand-alone mode* using docker containers based on 
> jupyter docker stack here:
> [https://github.com/jupyter/docker-stacks/blob/master/pyspark-notebook/Dockerfile]
>  
> $ conda list | grep arrow
>  *arrow-cpp 0.17.1* py38h1234567_5_cpu conda-forge
>  *pyarrow 0.17.1* py38h1234567_5_cpu conda-forge
> $ conda list | grep pandas
>  *pandas 1.0.5* py38hcb8c335_0 conda-forge
>  
> *To reproduce:*
> {code:java}
> import numpy as np
> import pandas as pd
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("spark://10.0.1.40:7077") \
> .config("spark.sql.execution.arrow.enabled", "true") \
> .appName('test_arrow') \
> .getOrCreate()
> 
> # Generate a pandas DataFrame
> pdf = pd.DataFrame(np.random.rand(100, 3))
> # Create a Spark DataFrame from a pandas DataFrame using Arrow
> df = spark.createDataFrame(pdf)
> # Convert the Spark DataFrame back to a pandas DataFrame using Arrow
> result_pdf = df.select("*").toPandas()
> {code}
>  
> ==
> /usr/local/spark/python/pyspark/sql/pandas/conversion.py:134: UserWarning: 
> toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
>  An error occurred while calling o55.getResult.
>  : org.apache.spark.SparkException: Exception thrown in awaitResult: 
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
>  at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:88)
>  at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:84)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>  at py4j.Gateway.invoke(Gateway.java:282)
>  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>  at py4j.commands.CallCommand.execute(CallCommand.java:79)
>  at py4j.GatewayConnection.run(GatewayConnection.java:238)
>  at java.base/java.lang.Thread.run(Thread.java:834)
>  Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 14 in stage 0.0 failed 4 times, most recent failure: Lost task 
> 14.3 in stage 0.0 (TID 31, 10.0.1.43, executor 0): 
> java.lang.UnsupportedOperationException: sun.misc.Unsafe or 
> java.nio.DirectByteBuffer.(long, int) not available
>  at 
> io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:490)
>  at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
>  at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
>  at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
>  at org.apache.arrow.vector.ipc.ReadChannel.readFully(ReadChannel.java:81)
>  at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessageBody(MessageSerializer.java:696)
>  at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeRecordBatch(MessageSerializer.java:344)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$.loadBatch(ArrowConverters.scala:189)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:165)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.(ArrowConverters.scala:144)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConv

[jira] [Commented] (SPARK-32174) toPandas attempted Arrow optimization but has reached an error and can not continue

2020-07-07 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152958#comment-17152958
 ] 

Bryan Cutler commented on SPARK-32174:
--

>From the stacktrace, it looks like you are using JDK9 or above, which Arrow 
>(really netty) needs the JVM system property 
>\{{io.netty.tryReflectionSetAccessible}} set to true, see 
>https://issues.apache.org/jira/browse/SPARK-29923 , also in the release notes. 
>Could you confirm if this solves your issue?

> toPandas attempted Arrow optimization but has reached an error and can not 
> continue
> ---
>
> Key: SPARK-32174
> URL: https://issues.apache.org/jira/browse/SPARK-32174
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, running in *stand-alone* mode
>Reporter: Ramin Hazegh
>Priority: Major
>
> h4. Converting a dataframe to Panda data frame using toPandas() fails.
>  
> *Spark 3.0.0 Running in stand-alone mode* using docker containers based on 
> jupyter docker stack here:
> [https://github.com/jupyter/docker-stacks/blob/master/pyspark-notebook/Dockerfile]
>  
> $ conda list | grep arrow
>  *arrow-cpp 0.17.1* py38h1234567_5_cpu conda-forge
>  *pyarrow 0.17.1* py38h1234567_5_cpu conda-forge
> $ conda list | grep pandas
>  *pandas 1.0.5* py38hcb8c335_0 conda-forge
>  
> *To reproduce:*
> {code:java}
> import numpy as np
> import pandas as pd
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("spark://10.0.1.40:7077") \
> .config("spark.sql.execution.arrow.enabled", "true") \
> .appName('test_arrow') \
> .getOrCreate()
> 
> # Generate a pandas DataFrame
> pdf = pd.DataFrame(np.random.rand(100, 3))
> # Create a Spark DataFrame from a pandas DataFrame using Arrow
> df = spark.createDataFrame(pdf)
> # Convert the Spark DataFrame back to a pandas DataFrame using Arrow
> result_pdf = df.select("*").toPandas()
> {code}
>  
> ==
> /usr/local/spark/python/pyspark/sql/pandas/conversion.py:134: UserWarning: 
> toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
>  An error occurred while calling o55.getResult.
>  : org.apache.spark.SparkException: Exception thrown in awaitResult: 
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
>  at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:88)
>  at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:84)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>  at py4j.Gateway.invoke(Gateway.java:282)
>  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>  at py4j.commands.CallCommand.execute(CallCommand.java:79)
>  at py4j.GatewayConnection.run(GatewayConnection.java:238)
>  at java.base/java.lang.Thread.run(Thread.java:834)
>  Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 14 in stage 0.0 failed 4 times, most recent failure: Lost task 
> 14.3 in stage 0.0 (TID 31, 10.0.1.43, executor 0): 
> java.lang.UnsupportedOperationException: sun.misc.Unsafe or 
> java.nio.DirectByteBuffer.(long, int) not available
>  at 
> io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:490)
>  at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
>  at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
>  at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
>  at org.apache.arrow.vector.ipc.ReadChannel.readFully(ReadChannel.java:81)
>  at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessageBody(MessageSerializer.java:696)
>  at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeRecordBatch(MessageSerializer.java:344)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$.loadBatch(ArrowConverters.scala:189)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:165)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.(ArrowConverter

[jira] [Created] (SPARK-32162) Improve Pandas Grouped Map with Window test output

2020-07-02 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-32162:


 Summary: Improve Pandas Grouped Map with Window test output
 Key: SPARK-32162
 URL: https://issues.apache.org/jira/browse/SPARK-32162
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Tests
Affects Versions: 3.0.0
Reporter: Bryan Cutler


The output of GroupedMapInPandasTests.test_grouped_over_window_with_key is not 
helpful, only gives 

{code}

==
FAIL: test_grouped_over_window_with_key 
(pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
--
Traceback (most recent call last):
  File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 588, 
in test_grouped_over_window_with_key
self.assertTrue(all([r[0] for r in result]))
AssertionError: False is not true

--
Ran 21 tests in 141.194s

FAILED (failures=1)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32098) Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

2020-06-25 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-32098:


Assignee: Hyukjin Kwon

> Use iloc for positional slicing instead of direct slicing in createDataFrame 
> with Arrow
> ---
>
> Key: SPARK-32098
> URL: https://issues.apache.org/jira/browse/SPARK-32098
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
>
> When you use floats are index of pandas, it produces a wrong results:
> {code}
> >>> import pandas as pd
> >>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 
> >>> 4.])).show()
> +---+
> |  a|
> +---+
> |  1|
> |  1|
> |  2|
> +---+
> {code}
> This is because direct slicing uses the value as index when the index 
> contains floats:
> {code}
> >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:]
>  a
> 2.0  1
> 3.0  2
> 4.0  3
> >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:]
>  a
> 4.0  3
> >>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:]
>a
> 4  3
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32098) Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

2020-06-25 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-32098.
--
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28928
[https://github.com/apache/spark/pull/28928]

> Use iloc for positional slicing instead of direct slicing in createDataFrame 
> with Arrow
> ---
>
> Key: SPARK-32098
> URL: https://issues.apache.org/jira/browse/SPARK-32098
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
>
> When you use floats are index of pandas, it produces a wrong results:
> {code}
> >>> import pandas as pd
> >>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 
> >>> 4.])).show()
> +---+
> |  a|
> +---+
> |  1|
> |  1|
> |  2|
> +---+
> {code}
> This is because direct slicing uses the value as index when the index 
> contains floats:
> {code}
> >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:]
>  a
> 2.0  1
> 3.0  2
> 4.0  3
> >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:]
>  a
> 4.0  3
> >>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:]
>a
> 4  3
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31998) Change package references for ArrowBuf

2020-06-24 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-31998:
-
Component/s: (was: Spark Core)
 SQL

> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31998) Change package references for ArrowBuf

2020-06-24 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-31998:
-
Issue Type: Improvement  (was: Bug)

> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor

2020-06-23 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-32080:
-
Priority: Trivial  (was: Major)

> Simplify ArrowColumnVector ListArray accessor
> -
>
> Key: SPARK-32080
> URL: https://issues.apache.org/jira/browse/SPARK-32080
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Trivial
>
> The ArrowColumnVector ListArray accessor calculates start and end offset 
> indices manually. There were APIs added in Arrow 0.15.0 that do this and 
> using them will simplify this code and make use of more stable APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor

2020-06-23 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-32080:


 Summary: Simplify ArrowColumnVector ListArray accessor
 Key: SPARK-32080
 URL: https://issues.apache.org/jira/browse/SPARK-32080
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Bryan Cutler


The ArrowColumnVector ListArray accessor calculates start and end offset 
indices manually. There were APIs added in Arrow 0.15.0 that do this and using 
them will simplify this code and make use of more stable APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion

2020-06-10 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-31964:


 Summary: Avoid Pandas import for CategoricalDtype with Arrow 
conversion
 Key: SPARK-31964
 URL: https://issues.apache.org/jira/browse/SPARK-31964
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Bryan Cutler


The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and 
currently pyspark checks 2 places to import. It would be better check the type 
as a string and avoid any imports.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31915) Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs

2020-06-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-31915.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28777
[https://github.com/apache/spark/pull/28777]

> Resolve the grouping column properly per the case sensitivity in grouped and 
> cogrouped pandas UDFs
> --
>
> Key: SPARK-31915
> URL: https://issues.apache.org/jira/browse/SPARK-31915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> {code}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([[1, 1]], ["column", "Score"])
> @pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
> def my_pandas_udf(pdf):
> return pdf.assign(Score=0.5)
> df.groupby('COLUMN').apply(my_pandas_udf).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could 
> be: COLUMN, COLUMN.;
> {code}
> {code}
> df1 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df2 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df1.groupby("COLUMN").cogroup(
> df2.groupby("COLUMN")
> ).applyInPandas(lambda r, l: r + l, df1.schema).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input 
> columns: [COLUMN, COLUMN, value, value];;
> 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, 
> column#13L, value#14L), [column#22L, value#23L]
> :- Project [COLUMN#9L, column#9L, value#10L]
> :  +- LogicalRDD [column#9L, value#10L], false
> +- Project [COLUMN#13L, column#13L, value#14L]
>+- LogicalRDD [column#13L, value#14L], false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31915) Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs

2020-06-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-31915:


Assignee: Hyukjin Kwon

> Resolve the grouping column properly per the case sensitivity in grouped and 
> cogrouped pandas UDFs
> --
>
> Key: SPARK-31915
> URL: https://issues.apache.org/jira/browse/SPARK-31915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([[1, 1]], ["column", "Score"])
> @pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
> def my_pandas_udf(pdf):
> return pdf.assign(Score=0.5)
> df.groupby('COLUMN').apply(my_pandas_udf).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could 
> be: COLUMN, COLUMN.;
> {code}
> {code}
> df1 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df2 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df1.groupby("COLUMN").cogroup(
> df2.groupby("COLUMN")
> ).applyInPandas(lambda r, l: r + l, df1.schema).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input 
> columns: [COLUMN, COLUMN, value, value];;
> 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, 
> column#13L, value#14L), [column#22L, value#23L]
> :- Project [COLUMN#9L, column#9L, value#10L]
> :  +- LogicalRDD [column#9L, value#10L], false
> +- Project [COLUMN#13L, column#13L, value#14L]
>+- LogicalRDD [column#13L, value#14L], false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2020-05-27 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25351.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 26585
[https://github.com/apache/spark/pull/26585]

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Assignee: Jalpan Randeri
>Priority: Major
>  Labels: bulk-closed
> Fix For: 3.1.0
>
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2020-05-27 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-25351:


Assignee: Jalpan Randeri

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Assignee: Jalpan Randeri
>Priority: Major
>  Labels: bulk-closed
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31704) PandasUDFType.GROUPED_AGG with Java 11

2020-05-13 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106500#comment-17106500
 ] 

Bryan Cutler commented on SPARK-31704:
--

This is due to a Netty API that Arrow uses and unfortunately, it currently 
needs the following Java option set to get working 
{{-Dio.netty.tryReflectionSetAccessible=true}}.  See 
https://issues.apache.org/jira/browse/SPARK-29924 which added documentation for 
this here https://github.com/apache/spark/blob/master/docs/index.md#downloading.

> PandasUDFType.GROUPED_AGG with Java 11
> --
>
> Key: SPARK-31704
> URL: https://issues.apache.org/jira/browse/SPARK-31704
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: java jdk: 11
> python: 3.7
>  
>Reporter: Markus Tretzmüller
>Priority: Minor
>  Labels: newbie
>
> Running the example from the 
> [docs|https://spark.apache.org/docs/3.0.0-preview2/api/python/pyspark.sql.html#module-pyspark.sql.functions]
>  gives an error with java 11. It works with java 8.
> {code:python}
> import findspark
> findspark.init('/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7')
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql import Window
> from pyspark.sql import SparkSession
> if __name__ == '__main__':
> spark = SparkSession \
> .builder \
> .appName('test') \
> .getOrCreate()
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf("double", PandasUDFType.GROUPED_AGG)
> def mean_udf(v):
> return v.mean()
> w = (Window.partitionBy('id')
>  .orderBy('v')
>  .rowsBetween(-1, 0))
> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
> {code}
> {noformat}
> File 
> "/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o81.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 
> in stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 
> (TID 37, 131.130.32.15, executor driver): 
> java.lang.UnsupportedOperationException: sun.misc.Unsafe or 
> java.nio.DirectByteBuffer.(long, int) not available
>   at 
> io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)
>   at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
>   at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
>   at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
>   at 
> org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:240)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:94)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:101)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:373)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:213)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31629) "py4j.protocol.Py4JJavaError: An error occurred while calling o90.save" in pyspark 2.3.1

2020-05-05 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100154#comment-17100154
 ] 

Bryan Cutler commented on SPARK-31629:
--

[~appleyuchi] are you able to try out a more recent version of Spark?

> "py4j.protocol.Py4JJavaError: An error occurred while calling o90.save" in 
> pyspark 2.3.1
> 
>
> Key: SPARK-31629
> URL: https://issues.apache.org/jira/browse/SPARK-31629
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Ubuntu19.10
>  
> anaconda3-python3.6.10
>  
> scala 2.11.8
>  
> apache-hive-3.0.0-bin
>  
> hadoop-2.7.7
>  
> spark-2.3.1-bin-hadoop2.7
>  
> java version "1.8.0_131"
>  
> Mysql Server version: 8.0.19-0ubuntu0.19.10.3 (Ubuntu)
>  
> driver:mysql-connector-java-8.0.20.jar
>  
> [Driver 
> link|[https://mvnrepository.com/artifact/mysql/mysql-connector-java/8.0.20]]
>  
>Reporter: appleyuchi
>Priority: Critical
> Fix For: 1.4.1, 1.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I have search the forum,
> SPARK-8365
> mentioned the same issue in spark 1.4.0
> SPARK-8368
> fix it in spark 1.4.1 1.5.0
>  
> However,in spark 2.3.1,this bug occur again
> Please help me ,thanks~!!!
> #--
> test.py
> [https://paste.ubuntu.com/p/HJfbcQ2zq3/]
>  
> running method is:
> ①spark-submit --master yarn --deploy-mode cluster test.py
> ②pyspark --master yarn( and then paste the code above)
> each method can replicate this error.
>  
> then I got:
> Traceback (most recent call last):
>  File "test.py", line 45, in 
>  password="appleyuchi").mode('append').save()
>  File 
> "/home/appleyuchi/bigdata/hadoop_tmp/nm-local-dir/usercache/appleyuchi/appcache/application_1588504345289_0003/container_1588504345289_0003_01_01/pyspark.zip/pyspark/sql/readwriter.py",
>  line 703, in save
>  File 
> "/home/appleyuchi/bigdata/hadoop_tmp/nm-local-dir/usercache/appleyuchi/appcache/application_1588504345289_0003/container_1588504345289_0003_01_01/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>  File 
> "/home/appleyuchi/bigdata/hadoop_tmp/nm-local-dir/usercache/appleyuchi/appcache/application_1588504345289_0003/container_1588504345289_0003_01_01/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>  File 
> "/home/appleyuchi/bigdata/hadoop_tmp/nm-local-dir/usercache/appleyuchi/appcache/application_1588504345289_0003/container_1588504345289_0003_01_01/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
>  {color:#FF}*py4j.protocol.Py4JJavaError: An error occurred while calling 
> o90.save.*{color}
>  : java.sql.SQLSyntaxErrorException: Unknown database 
> 'leaf'*{color:#FF}(I'm sure this database exists){color}*
>  at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
>  at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
>  at 
> com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
>  at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:836)
>  at com.mysql.cj.jdbc.ConnectionImpl.(ConnectionImpl.java:456)
>  at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:246)
>  at 
> com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:197)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.execute

[jira] [Assigned] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0

2020-04-13 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-31306:


Assignee: Ben

> rand() function documentation suggests an inclusive upper bound of 1.0
> --
>
> Key: SPARK-31306
> URL: https://issues.apache.org/jira/browse/SPARK-31306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Ben
>Assignee: Ben
>Priority: Major
>
>  The rand() function in PySpark, Spark, and R is documented as drawing from 
> U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing 
> (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing 
> `U[0.0, 1.0]` suggests the value returned could include 1.0). The function 
> itself uses Rand(), which is [documented |#L71] as having a result in the 
> range [0, 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0

2020-04-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-31306:


Assignee: Bryan Cutler

> rand() function documentation suggests an inclusive upper bound of 1.0
> --
>
> Key: SPARK-31306
> URL: https://issues.apache.org/jira/browse/SPARK-31306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Ben
>Assignee: Bryan Cutler
>Priority: Major
>
>  The rand() function in PySpark, Spark, and R is documented as drawing from 
> U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing 
> (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing 
> `U[0.0, 1.0]` suggests the value returned could include 1.0). The function 
> itself uses Rand(), which is [documented |#L71] as having a result in the 
> range [0, 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0

2020-04-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-31306:


Assignee: (was: Bryan Cutler)

> rand() function documentation suggests an inclusive upper bound of 1.0
> --
>
> Key: SPARK-31306
> URL: https://issues.apache.org/jira/browse/SPARK-31306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Ben
>Priority: Major
>
>  The rand() function in PySpark, Spark, and R is documented as drawing from 
> U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing 
> (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing 
> `U[0.0, 1.0]` suggests the value returned could include 1.0). The function 
> itself uses Rand(), which is [documented |#L71] as having a result in the 
> range [0, 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0

2020-04-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-31306.
--
Resolution: Fixed

Issue resolved by pull request 28071
https://github.com/apache/spark/pull/28071

> rand() function documentation suggests an inclusive upper bound of 1.0
> --
>
> Key: SPARK-31306
> URL: https://issues.apache.org/jira/browse/SPARK-31306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Ben
>Priority: Major
>
>  The rand() function in PySpark, Spark, and R is documented as drawing from 
> U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing 
> (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing 
> `U[0.0, 1.0]` suggests the value returned could include 1.0). The function 
> itself uses Rand(), which is [documented |#L71] as having a result in the 
> range [0, 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31299) Pyspark.ml.clustering illegalArgumentException with dataframe created from rows

2020-04-01 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-31299:
-
Description: 
I hope this is the right place and way to report a bug in (at least) the 
PySpark API:

BisectingKMeans in the following example is only exemplary, the error occurs 
with all clustering algorithms:
{code:python}
from pyspark.sql import Row
from pyspark.mllib.linalg import DenseVector
from pyspark.ml.clustering import BisectingKMeans

data = spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 
1.0, 1.0, 1.0, 0.0, 3.0])),
 Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
 Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])

kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
model = kmeans.fit(data)
{code}
The .fit-call in the last line will fail with the following error:
{code:java}
Py4JJavaError: An error occurred while calling o51.fit.
: java.lang.IllegalArgumentException: requirement failed: Column test_features 
must be of type equal to one of the following types: 
[struct,values:array>, 
array, array] but was actually of type 
struct,values:array>.
{code}
As can be seen, the data type reported to be passed to the function is the 
first data type in the list of allowed data types, yet the call ends in an 
error because of it.

See my [StackOverflow 
issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]]
 for more context

  was:
I hope this is the right place and way to report a bug in (at least) the 
PySpark API:

BisectingKMeans in the following example is only exemplary, the error occurs 
with all clustering algorithms:
{code:java}
from pyspark.sql import Row
from pyspark.mllib.linalg import DenseVector
from pyspark.ml.clustering import BisectingKMeansdata = 
spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 1.0, 
1.0, 1.0, 0.0, 3.0])),
 Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
 Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])

kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
model = kmeans.fit(data)
{code}
The .fit-call in the last line will fail with the following error:
{code:java}
Py4JJavaError: An error occurred while calling o51.fit.
: java.lang.IllegalArgumentException: requirement failed: Column test_features 
must be of type equal to one of the following types: 
[struct,values:array>, 
array, array] but was actually of type 
struct,values:array>.
{code}
As can be seen, the data type reported to be passed to the function is the 
first data type in the list of allowed data types, yet the call ends in an 
error because of it.

See my [StackOverflow 
issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]]
 for more context


> Pyspark.ml.clustering illegalArgumentException with dataframe created from 
> rows
> ---
>
> Key: SPARK-31299
> URL: https://issues.apache.org/jira/browse/SPARK-31299
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Lukas Thaler
>Priority: Major
>
> I hope this is the right place and way to report a bug in (at least) the 
> PySpark API:
> BisectingKMeans in the following example is only exemplary, the error occurs 
> with all clustering algorithms:
> {code:python}
> from pyspark.sql import Row
> from pyspark.mllib.linalg import DenseVector
> from pyspark.ml.clustering import BisectingKMeans
> data = spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 
> 200.0, 1.0, 1.0, 1.0, 0.0, 3.0])),
>  Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
>  Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])
> kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
> model = kmeans.fit(data)
> {code}
> The .fit-call in the last line will fail with the following error:
> {code:java}
> Py4JJavaError: An error occurred while calling o51.fit.
> : java.lang.IllegalArgumentException: requirement failed: Column 
> test_features must be of type equal to one of the following types: 
> [

[jira] [Commented] (SPARK-31299) Pyspark.ml.clustering illegalArgumentException with dataframe created from rows

2020-04-01 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073027#comment-17073027
 ] 

Bryan Cutler commented on SPARK-31299:
--

It looks like you are using {{DenseVector}} from {{pyspark.mllib}} instead of 
{{pyspark.ml}}. It's confusing that there are 2, but they are different. Try 
importing and defining your vectors like this:

{code}
from pyspark.ml.linalg import Vectors
test_features=Vectors.dense([43.0, 0.0, 200.0, 1.0, 1.0, 1.0, 0.0, 3.0])
{code}

make your DataFrame with rows of these features and it should work. I'll 
resolve this, but let me know if you have further issues.


> Pyspark.ml.clustering illegalArgumentException with dataframe created from 
> rows
> ---
>
> Key: SPARK-31299
> URL: https://issues.apache.org/jira/browse/SPARK-31299
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Lukas Thaler
>Priority: Major
>
> I hope this is the right place and way to report a bug in (at least) the 
> PySpark API:
> BisectingKMeans in the following example is only exemplary, the error occurs 
> with all clustering algorithms:
> {code:java}
> from pyspark.sql import Row
> from pyspark.mllib.linalg import DenseVector
> from pyspark.ml.clustering import BisectingKMeansdata = 
> spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 1.0, 
> 1.0, 1.0, 0.0, 3.0])),
>  Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
>  Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])
> kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
> model = kmeans.fit(data)
> {code}
> The .fit-call in the last line will fail with the following error:
> {code:java}
> Py4JJavaError: An error occurred while calling o51.fit.
> : java.lang.IllegalArgumentException: requirement failed: Column 
> test_features must be of type equal to one of the following types: 
> [struct,values:array>, 
> array, array] but was actually of type 
> struct,values:array>.
> {code}
> As can be seen, the data type reported to be passed to the function is the 
> first data type in the list of allowed data types, yet the call ends in an 
> error because of it.
> See my [StackOverflow 
> issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]]
>  for more context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31299) Pyspark.ml.clustering illegalArgumentException with dataframe created from rows

2020-04-01 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-31299.
--
Resolution: Not A Problem

> Pyspark.ml.clustering illegalArgumentException with dataframe created from 
> rows
> ---
>
> Key: SPARK-31299
> URL: https://issues.apache.org/jira/browse/SPARK-31299
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Lukas Thaler
>Priority: Major
>
> I hope this is the right place and way to report a bug in (at least) the 
> PySpark API:
> BisectingKMeans in the following example is only exemplary, the error occurs 
> with all clustering algorithms:
> {code:java}
> from pyspark.sql import Row
> from pyspark.mllib.linalg import DenseVector
> from pyspark.ml.clustering import BisectingKMeansdata = 
> spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 1.0, 
> 1.0, 1.0, 0.0, 3.0])),
>  Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
>  Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])
> kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
> model = kmeans.fit(data)
> {code}
> The .fit-call in the last line will fail with the following error:
> {code:java}
> Py4JJavaError: An error occurred while calling o51.fit.
> : java.lang.IllegalArgumentException: requirement failed: Column 
> test_features must be of type equal to one of the following types: 
> [struct,values:array>, 
> array, array] but was actually of type 
> struct,values:array>.
> {code}
> As can be seen, the data type reported to be passed to the function is the 
> first data type in the list of allowed data types, yet the call ends in an 
> error because of it.
> See my [StackOverflow 
> issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]]
>  for more context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-03-06 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053717#comment-17053717
 ] 

Bryan Cutler commented on SPARK-30961:
--

Just to be clear, this is only an issue with Spark 2.4.x. The issue does not 
affect Spark 3.0.0 and above.

> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
> """
> Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.:param series: pandas.Series
> :param data_type: a Spark data type for the series
> """
> from pyspark.sql.utils import require_minimum_pandas_version
> require_minimum_pandas_version()from pandas import to_datetime
> if type(data_type) == DateType:
> return to_datetime(series).dt.date
> else:
> return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-03-06 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-30961.
--
Resolution: Won't Fix

Thanks [~KevinAppel] and [~nicornk] for the info, I'll go ahead and close this 
then.

> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
> """
> Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.:param series: pandas.Series
> :param data_type: a Spark data type for the series
> """
> from pyspark.sql.utils import require_minimum_pandas_version
> require_minimum_pandas_version()from pandas import to_datetime
> if type(data_type) == DateType:
> return to_datetime(series).dt.date
> else:
> return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-02-27 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046801#comment-17046801
 ] 

Bryan Cutler commented on SPARK-30961:
--

Yes, we should be able to keep Spark 3.x up to date with the latest pyarrow. It 
is currently being tested against 0.15.1 and I've tested manually with 0.16.0 
also.

> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
> """
> Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.:param series: pandas.Series
> :param data_type: a Spark data type for the series
> """
> from pyspark.sql.utils import require_minimum_pandas_version
> require_minimum_pandas_version()from pandas import to_datetime
> if type(data_type) == DateType:
> return to_datetime(series).dt.date
> else:
> return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-02-26 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045961#comment-17045961
 ] 

Bryan Cutler commented on SPARK-30961:
--

[~nicornk] there were a number of fixes related to Arrow that went into the 
master branch for 3.0.0 and not branch-2.4, notably SPARK-26887 and SPARK-26566 
for the date issue. The latter was an upgrade of Arrow, and it is not the usual 
policy to backport upgrades. I would recommend using an older version of 
pyarrow with Spark, version 0.8.0 would be best, but you might be able to use 
0.11.1 without issues.

> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
> """
> Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.:param series: pandas.Series
> :param data_type: a Spark data type for the series
> """
> from pyspark.sql.utils import require_minimum_pandas_version
> require_minimum_pandas_version()from pandas import to_datetime
> if type(data_type) == DateType:
> return to_datetime(series).dt.date
> else:
> return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30861) Deprecate constructor of SQLContext and getOrCreate in SQLContext at PySpark

2020-02-20 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-30861:
-
Fix Version/s: 3.0.0

> Deprecate constructor of SQLContext and getOrCreate in SQLContext at PySpark
> 
>
> Key: SPARK-30861
> URL: https://issues.apache.org/jira/browse/SPARK-30861
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Those were removed as of SPARK-25908. We should deprecate them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30861) Deprecate constructor of SQLContext and getOrCreate in SQLContext at PySpark

2020-02-20 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-30861.
--
Resolution: Fixed

> Deprecate constructor of SQLContext and getOrCreate in SQLContext at PySpark
> 
>
> Key: SPARK-30861
> URL: https://issues.apache.org/jira/browse/SPARK-30861
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Those were removed as of SPARK-25908. We should deprecate them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30861) Deprecate constructor of SQLContext and getOrCreate in SQLContext at PySpark

2020-02-20 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041268#comment-17041268
 ] 

Bryan Cutler commented on SPARK-30861:
--

Issue resolved by pull request 27614
https://github.com/apache/spark/pull/27614

> Deprecate constructor of SQLContext and getOrCreate in SQLContext at PySpark
> 
>
> Key: SPARK-30861
> URL: https://issues.apache.org/jira/browse/SPARK-30861
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Those were removed as of SPARK-25908. We should deprecate them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30861) Deprecate constructor of SQLContext and getOrCreate in SQLContext at PySpark

2020-02-20 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-30861:


Assignee: Hyukjin Kwon

> Deprecate constructor of SQLContext and getOrCreate in SQLContext at PySpark
> 
>
> Key: SPARK-30861
> URL: https://issues.apache.org/jira/browse/SPARK-30861
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Those were removed as of SPARK-25908. We should deprecate them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30834) Add note for recommended versions of Pandas and PyArrow for 2.4.x

2020-02-14 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-30834:
-
Component/s: PySpark

> Add note for recommended versions of Pandas and PyArrow for 2.4.x
> -
>
> Key: SPARK-30834
> URL: https://issues.apache.org/jira/browse/SPARK-30834
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.4.5
>Reporter: Bryan Cutler
>Priority: Major
>
> CI testing for branch 2.4 has been with the versions below. These are 
> recommened and any newer versions can not be guaranteed correct.
>  
> for 2.4, python 3.6.8:
> -bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
> pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
>  pandas: 0.19.2
>  pyarrow: 0.8.0
> for master/3.0, python 3.6.8:
> -bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
> pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
>  pandas: 0.24.2
>  pyarrow: 0.15.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30834) Add note for recommended versions of Pandas and PyArrow for 2.4.x

2020-02-14 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-30834:
-
Description: 
CI testing for branch 2.4 has been with the versions below. These are 
recommened and any newer versions can not be guaranteed correct.

 

for 2.4, python 3.6.8:

-bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
 pandas: 0.19.2
 pyarrow: 0.8.0

for master/3.0, python 3.6.8:

-bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
 pandas: 0.24.2
 pyarrow: 0.15.1

  was:
CI testing for branch 2.4 has been with the versions below. These are 
recommened and any newer versions can not be guaranteed correct.

 

for 2.4, python 3.6.8:

 

{{-bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
pandas: 0.19.2
pyarrow: 0.8.0}}

for master/3.0, python 3.6.8:

 

{{-bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
pandas: 0.24.2
pyarrow: 0.15.1}}


> Add note for recommended versions of Pandas and PyArrow for 2.4.x
> -
>
> Key: SPARK-30834
> URL: https://issues.apache.org/jira/browse/SPARK-30834
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.5
>Reporter: Bryan Cutler
>Priority: Major
>
> CI testing for branch 2.4 has been with the versions below. These are 
> recommened and any newer versions can not be guaranteed correct.
>  
> for 2.4, python 3.6.8:
> -bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
> pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
>  pandas: 0.19.2
>  pyarrow: 0.8.0
> for master/3.0, python 3.6.8:
> -bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
> pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
>  pandas: 0.24.2
>  pyarrow: 0.15.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30834) Add note for recommended versions of Pandas and PyArrow for 2.4.x

2020-02-14 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-30834:
-
Description: 
CI testing for branch 2.4 has been with the versions below. These are 
recommened and any newer versions can not be guaranteed correct.

 

for 2.4, python 3.6.8:

 

{{-bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
pandas: 0.19.2
pyarrow: 0.8.0}}

for master/3.0, python 3.6.8:

 

{{-bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
pandas: 0.24.2
pyarrow: 0.15.1}}

  was:CI testing for branch 2.4 has been with the versions below. These are 
recommened and any newer versions can not be guaranteed correct.


> Add note for recommended versions of Pandas and PyArrow for 2.4.x
> -
>
> Key: SPARK-30834
> URL: https://issues.apache.org/jira/browse/SPARK-30834
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.5
>Reporter: Bryan Cutler
>Priority: Major
>
> CI testing for branch 2.4 has been with the versions below. These are 
> recommened and any newer versions can not be guaranteed correct.
>  
> for 2.4, python 3.6.8:
>  
> {{-bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
> pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
> pandas: 0.19.2
> pyarrow: 0.8.0}}
> for master/3.0, python 3.6.8:
>  
> {{-bash-4.1$ python -c "import pandas; import pyarrow; print('pandas: %s' % 
> pandas.__version__); print('pyarrow: %s' % pyarrow.__version__)"
> pandas: 0.24.2
> pyarrow: 0.15.1}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30834) Add note for recommended versions of Pandas and PyArrow for 2.4.x

2020-02-14 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-30834:


 Summary: Add note for recommended versions of Pandas and PyArrow 
for 2.4.x
 Key: SPARK-30834
 URL: https://issues.apache.org/jira/browse/SPARK-30834
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.5
Reporter: Bryan Cutler


CI testing for branch 2.4 has been with the versions below. These are 
recommened and any newer versions can not be guaranteed correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30777) PySpark test_arrow tests fail with Pandas >= 1.0.0

2020-02-10 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033873#comment-17033873
 ] 

Bryan Cutler commented on SPARK-30777:
--

[~dongjoon] I don't think it's a blocker, only the tests fail and our minimum 
pandas version that we test with is 0.23

> PySpark test_arrow tests fail with Pandas >= 1.0.0
> --
>
> Key: SPARK-30777
> URL: https://issues.apache.org/jira/browse/SPARK-30777
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Some tests fail using Pandas 1.0.0 and above due to removal of  attr "ix" 
> from DataFrame



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30777) PySpark test_arrow tests fail with Pandas >= 1.0.0

2020-02-10 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033871#comment-17033871
 ] 

Bryan Cutler commented on SPARK-30777:
--

I'm working on the patch

> PySpark test_arrow tests fail with Pandas >= 1.0.0
> --
>
> Key: SPARK-30777
> URL: https://issues.apache.org/jira/browse/SPARK-30777
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Some tests fail using Pandas 1.0.0 and above due to removal of  attr "ix" 
> from DataFrame



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30777) PySpark test_arrow tests fail with Pandas > 1.0.0

2020-02-10 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-30777:


 Summary: PySpark test_arrow tests fail with Pandas > 1.0.0
 Key: SPARK-30777
 URL: https://issues.apache.org/jira/browse/SPARK-30777
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Affects Versions: 3.0.0
Reporter: Bryan Cutler


Some tests fail using Pandas 1.0.0 and above due to removal of  attr "ix" from 
DataFrame



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30640) Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps

2020-01-26 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-30640.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27358
[https://github.com/apache/spark/pull/27358]

> Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps
> --
>
> Key: SPARK-30640
> URL: https://issues.apache.org/jira/browse/SPARK-30640
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> During conversion of Arrow to Pandas, timestamp columns are modified to 
> localize for the current timezone. If there are no timestamp columns, this 
> can sometimes result in unnecessary copies of the data. See 
> [https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html] for 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30640) Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps

2020-01-26 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-30640:


Assignee: Bryan Cutler

> Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps
> --
>
> Key: SPARK-30640
> URL: https://issues.apache.org/jira/browse/SPARK-30640
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> During conversion of Arrow to Pandas, timestamp columns are modified to 
> localize for the current timezone. If there are no timestamp columns, this 
> can sometimes result in unnecessary copies of the data. See 
> [https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html] for 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30640) Prevent unnessary copies of data in Arrow to Pandas conversion with Timestamps

2020-01-24 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-30640:


 Summary: Prevent unnessary copies of data in Arrow to Pandas 
conversion with Timestamps
 Key: SPARK-30640
 URL: https://issues.apache.org/jira/browse/SPARK-30640
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.4.4
Reporter: Bryan Cutler


During conversion of Arrow to Pandas, timestamp columns are modified to 
localize for the current timezone. If there are no timestamp columns, this can 
sometimes result in unnecessary copies of the data. See 
[https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html] for 
discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception

2020-01-20 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019770#comment-17019770
 ] 

Bryan Cutler commented on SPARK-24915:
--

[~jhereth] since there is already a lot of discussion on that PR I would leave 
it open until there is a conclusion on patching 2.4 or not. If so, then you 
could rebase or open a new PR against branch-2.4.

> Calling SparkSession.createDataFrame with schema can throw exception
> 
>
> Key: SPARK-24915
> URL: https://issues.apache.org/jira/browse/SPARK-24915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Python 3.6.3
> PySpark 2.3.1 (installed via pip)
> OSX 10.12.6
>Reporter: Stephen Spencer
>Priority: Major
>
> There seems to be a bug in PySpark when using the PySparkSQL session to 
> create a dataframe with a pre-defined schema.
> Code to reproduce the error:
> {code:java}
> from pyspark import SparkConf, SparkContext
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, StringType, Row
> conf = SparkConf().setMaster("local").setAppName("repro") 
> context = SparkContext(conf=conf) 
> session = SparkSession(context)
> # Construct schema (the order of fields is important)
> schema = StructType([
> StructField('field2', StructType([StructField('sub_field', StringType(), 
> False)]), False),
> StructField('field1', StringType(), False),
> ])
> # Create data to populate data frame
> data = [
> Row(field1="Hello", field2=Row(sub_field='world'))
> ]
> # Attempt to create the data frame supplying the schema
> # this will throw a ValueError
> df = session.createDataFrame(data, schema=schema)
> df.show(){code}
> Running this throws a ValueError
> {noformat}
> Traceback (most recent call last):
> File "schema_bug.py", line 18, in 
> df = session.createDataFrame(data, schema=schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 691, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in _createFromLocal
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in 
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in toInternal
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in 
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 439, in toInternal
> return self.dataType.toInternal(obj)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 619, in toInternal
> raise ValueError("Unexpected tuple %r with StructType" % obj)
> ValueError: Unexpected tuple 'Hello' with StructType{noformat}
> The problem seems to be here:
> https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603
> specifically the bit
> {code:java}
> zip(self.fields, obj, self._needConversion)
> {code}
> This zip statement seems to assume that obj and self.fields are ordered in 
> the same way, so that the elements of obj will correspond to the right fields 
> in the schema. However this is not true, a Row orders its elements 
> alphabetically but the fields in the schema are in whatever order they are 
> specified. In this example field2 is being initialised with the field1 
> element 'Hello'. If you re-order the fields in the schema to go (field1, 
> field2), the given example works without error.
> The schema in the repro is specifically designed to elicit the problem, the 
> fields are out of alphabetical order and one field is a StructType, making 
> chema._needSerializeAnyField==True . However we encountered this in real use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception

2020-01-13 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reopened SPARK-24915:
--

> Calling SparkSession.createDataFrame with schema can throw exception
> 
>
> Key: SPARK-24915
> URL: https://issues.apache.org/jira/browse/SPARK-24915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Python 3.6.3
> PySpark 2.3.1 (installed via pip)
> OSX 10.12.6
>Reporter: Stephen Spencer
>Priority: Major
>
> There seems to be a bug in PySpark when using the PySparkSQL session to 
> create a dataframe with a pre-defined schema.
> Code to reproduce the error:
> {code:java}
> from pyspark import SparkConf, SparkContext
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, StringType, Row
> conf = SparkConf().setMaster("local").setAppName("repro") 
> context = SparkContext(conf=conf) 
> session = SparkSession(context)
> # Construct schema (the order of fields is important)
> schema = StructType([
> StructField('field2', StructType([StructField('sub_field', StringType(), 
> False)]), False),
> StructField('field1', StringType(), False),
> ])
> # Create data to populate data frame
> data = [
> Row(field1="Hello", field2=Row(sub_field='world'))
> ]
> # Attempt to create the data frame supplying the schema
> # this will throw a ValueError
> df = session.createDataFrame(data, schema=schema)
> df.show(){code}
> Running this throws a ValueError
> {noformat}
> Traceback (most recent call last):
> File "schema_bug.py", line 18, in 
> df = session.createDataFrame(data, schema=schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 691, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in _createFromLocal
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in 
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in toInternal
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in 
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 439, in toInternal
> return self.dataType.toInternal(obj)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 619, in toInternal
> raise ValueError("Unexpected tuple %r with StructType" % obj)
> ValueError: Unexpected tuple 'Hello' with StructType{noformat}
> The problem seems to be here:
> https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603
> specifically the bit
> {code:java}
> zip(self.fields, obj, self._needConversion)
> {code}
> This zip statement seems to assume that obj and self.fields are ordered in 
> the same way, so that the elements of obj will correspond to the right fields 
> in the schema. However this is not true, a Row orders its elements 
> alphabetically but the fields in the schema are in whatever order they are 
> specified. In this example field2 is being initialised with the field1 
> element 'Hello'. If you re-order the fields in the schema to go (field1, 
> field2), the given example works without error.
> The schema in the repro is specifically designed to elicit the problem, the 
> fields are out of alphabetical order and one field is a StructType, making 
> chema._needSerializeAnyField==True . However we encountered this in real use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception

2020-01-13 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014719#comment-17014719
 ] 

Bryan Cutler commented on SPARK-24915:
--

[~jhereth] apologies for closing prematurely, I didn't know there was still 
some ongoing discussion in the PR. I don't think we can backport SPARK-29748, 
so I'll reopen this for now.

> Calling SparkSession.createDataFrame with schema can throw exception
> 
>
> Key: SPARK-24915
> URL: https://issues.apache.org/jira/browse/SPARK-24915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Python 3.6.3
> PySpark 2.3.1 (installed via pip)
> OSX 10.12.6
>Reporter: Stephen Spencer
>Priority: Major
>
> There seems to be a bug in PySpark when using the PySparkSQL session to 
> create a dataframe with a pre-defined schema.
> Code to reproduce the error:
> {code:java}
> from pyspark import SparkConf, SparkContext
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, StringType, Row
> conf = SparkConf().setMaster("local").setAppName("repro") 
> context = SparkContext(conf=conf) 
> session = SparkSession(context)
> # Construct schema (the order of fields is important)
> schema = StructType([
> StructField('field2', StructType([StructField('sub_field', StringType(), 
> False)]), False),
> StructField('field1', StringType(), False),
> ])
> # Create data to populate data frame
> data = [
> Row(field1="Hello", field2=Row(sub_field='world'))
> ]
> # Attempt to create the data frame supplying the schema
> # this will throw a ValueError
> df = session.createDataFrame(data, schema=schema)
> df.show(){code}
> Running this throws a ValueError
> {noformat}
> Traceback (most recent call last):
> File "schema_bug.py", line 18, in 
> df = session.createDataFrame(data, schema=schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 691, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in _createFromLocal
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in 
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in toInternal
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in 
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 439, in toInternal
> return self.dataType.toInternal(obj)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 619, in toInternal
> raise ValueError("Unexpected tuple %r with StructType" % obj)
> ValueError: Unexpected tuple 'Hello' with StructType{noformat}
> The problem seems to be here:
> https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603
> specifically the bit
> {code:java}
> zip(self.fields, obj, self._needConversion)
> {code}
> This zip statement seems to assume that obj and self.fields are ordered in 
> the same way, so that the elements of obj will correspond to the right fields 
> in the schema. However this is not true, a Row orders its elements 
> alphabetically but the fields in the schema are in whatever order they are 
> specified. In this example field2 is being initialised with the field1 
> element 'Hello'. If you re-order the fields in the schema to go (field1, 
> field2), the given example works without error.
> The schema in the repro is specifically designed to elicit the problem, the 
> fields are out of alphabetical order and one field is a StructType, making 
> chema._needSerializeAnyField==True . However we encountered this in real use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception

2020-01-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24915.
--
Resolution: Won't Fix

Closing in favor of fix in SPARK-29748

> Calling SparkSession.createDataFrame with schema can throw exception
> 
>
> Key: SPARK-24915
> URL: https://issues.apache.org/jira/browse/SPARK-24915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Python 3.6.3
> PySpark 2.3.1 (installed via pip)
> OSX 10.12.6
>Reporter: Stephen Spencer
>Priority: Major
>
> There seems to be a bug in PySpark when using the PySparkSQL session to 
> create a dataframe with a pre-defined schema.
> Code to reproduce the error:
> {code:java}
> from pyspark import SparkConf, SparkContext
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StructType, StructField, StringType, Row
> conf = SparkConf().setMaster("local").setAppName("repro") 
> context = SparkContext(conf=conf) 
> session = SparkSession(context)
> # Construct schema (the order of fields is important)
> schema = StructType([
> StructField('field2', StructType([StructField('sub_field', StringType(), 
> False)]), False),
> StructField('field1', StringType(), False),
> ])
> # Create data to populate data frame
> data = [
> Row(field1="Hello", field2=Row(sub_field='world'))
> ]
> # Attempt to create the data frame supplying the schema
> # this will throw a ValueError
> df = session.createDataFrame(data, schema=schema)
> df.show(){code}
> Running this throws a ValueError
> {noformat}
> Traceback (most recent call last):
> File "schema_bug.py", line 18, in 
> df = session.createDataFrame(data, schema=schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 691, in createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in _createFromLocal
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py",
>  line 423, in 
> data = [schema.toInternal(row) for row in data]
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in toInternal
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 601, in 
> for f, v, c in zip(self.fields, obj, self._needConversion))
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 439, in toInternal
> return self.dataType.toInternal(obj)
> File 
> "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py",
>  line 619, in toInternal
> raise ValueError("Unexpected tuple %r with StructType" % obj)
> ValueError: Unexpected tuple 'Hello' with StructType{noformat}
> The problem seems to be here:
> https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603
> specifically the bit
> {code:java}
> zip(self.fields, obj, self._needConversion)
> {code}
> This zip statement seems to assume that obj and self.fields are ordered in 
> the same way, so that the elements of obj will correspond to the right fields 
> in the schema. However this is not true, a Row orders its elements 
> alphabetically but the fields in the schema are in whatever order they are 
> specified. In this example field2 is being initialised with the field1 
> element 'Hello'. If you re-order the fields in the schema to go (field1, 
> field2), the given example works without error.
> The schema in the repro is specifically designed to elicit the problem, the 
> fields are out of alphabetical order and one field is a StructType, making 
> chema._needSerializeAnyField==True . However we encountered this in real use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22232) Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2020-01-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-22232.
--
Resolution: Won't Fix

Closing in favor for fix in SPARK-29748

> Row objects in pyspark created using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {code:none}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   # Putting fields in alphabetical order masks the issue
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2020-01-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-29748:


Assignee: Bryan Cutler

> Remove sorting of fields in PySpark SQL Row creation
> 
>
> Key: SPARK-29748
> URL: https://issues.apache.org/jira/browse/SPARK-29748
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Currently, when a PySpark Row is created with keyword arguments, the fields 
> are sorted alphabetically. This has created a lot of confusion with users 
> because it is not obvious (although it is stated in the pydocs) that they 
> will be sorted alphabetically, and then an error can occur later when 
> applying a schema and the field order does not match.
> The original reason for sorting fields is because kwargs in python < 3.6 are 
> not guaranteed to be in the same order that they were entered. Sorting 
> alphabetically would ensure a consistent order.  Matters are further 
> complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
> to be referenced by name when made by kwargs, but this flag is not serialized 
> with the Row and leads to inconsistent behavior.
> This JIRA proposes that any sorting of the Fields is removed. Users with 
> Python 3.6+ creating Rows with kwargs can continue to do so since Python will 
> ensure the order is the same as entered. Users with Python < 3.6 will have to 
> create Rows with an OrderedDict or by using the Row class as a factory 
> (explained in the pydoc).  If kwargs are used, an error will be raised or 
> based on a conf setting it can fall back to a LegacyRow that will sort the 
> fields as before. This LegacyRow will be immediately deprecated and removed 
> once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2020-01-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-29748.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26496
[https://github.com/apache/spark/pull/26496]

> Remove sorting of fields in PySpark SQL Row creation
> 
>
> Key: SPARK-29748
> URL: https://issues.apache.org/jira/browse/SPARK-29748
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, when a PySpark Row is created with keyword arguments, the fields 
> are sorted alphabetically. This has created a lot of confusion with users 
> because it is not obvious (although it is stated in the pydocs) that they 
> will be sorted alphabetically, and then an error can occur later when 
> applying a schema and the field order does not match.
> The original reason for sorting fields is because kwargs in python < 3.6 are 
> not guaranteed to be in the same order that they were entered. Sorting 
> alphabetically would ensure a consistent order.  Matters are further 
> complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
> to be referenced by name when made by kwargs, but this flag is not serialized 
> with the Row and leads to inconsistent behavior.
> This JIRA proposes that any sorting of the Fields is removed. Users with 
> Python 3.6+ creating Rows with kwargs can continue to do so since Python will 
> ensure the order is the same as entered. Users with Python < 3.6 will have to 
> create Rows with an OrderedDict or by using the Row class as a factory 
> (explained in the pydoc).  If kwargs are used, an error will be raised or 
> based on a conf setting it can fall back to a LegacyRow that will sort the 
> fields as before. This LegacyRow will be immediately deprecated and removed 
> once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-12-18 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999545#comment-16999545
 ] 

Bryan Cutler commented on SPARK-28502:
--

The problem is that returning nested StructTypes is not currently supported, so 
you will need to flatten the fields of the window separate columns. Since 
adding support for that is a new feature and not a bug, I'll close this.

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
> Fix For: 3.0.0
>
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-12-18 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-28502.
--
Resolution: Fixed

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
> Fix For: 3.0.0
>
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2019-12-03 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987252#comment-16987252
 ] 

Bryan Cutler commented on SPARK-29748:
--

[~zero323] I made some updates to the PR with remove the _LegacyRow and option 
for OrderedDict, and also like you suggested for Python 3.6 will automatically 
fall back to legacy behavior of sorting and print a warning to the user.

> Remove sorting of fields in PySpark SQL Row creation
> 
>
> Key: SPARK-29748
> URL: https://issues.apache.org/jira/browse/SPARK-29748
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, when a PySpark Row is created with keyword arguments, the fields 
> are sorted alphabetically. This has created a lot of confusion with users 
> because it is not obvious (although it is stated in the pydocs) that they 
> will be sorted alphabetically, and then an error can occur later when 
> applying a schema and the field order does not match.
> The original reason for sorting fields is because kwargs in python < 3.6 are 
> not guaranteed to be in the same order that they were entered. Sorting 
> alphabetically would ensure a consistent order.  Matters are further 
> complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
> to be referenced by name when made by kwargs, but this flag is not serialized 
> with the Row and leads to inconsistent behavior.
> This JIRA proposes that any sorting of the Fields is removed. Users with 
> Python 3.6+ creating Rows with kwargs can continue to do so since Python will 
> ensure the order is the same as entered. Users with Python < 3.6 will have to 
> create Rows with an OrderedDict or by using the Row class as a factory 
> (explained in the pydoc).  If kwargs are used, an error will be raised or 
> based on a conf setting it can fall back to a LegacyRow that will sort the 
> fields as before. This LegacyRow will be immediately deprecated and removed 
> once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs

2019-12-03 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987154#comment-16987154
 ] 

Bryan Cutler commented on SPARK-30063:
--

I haven't looked at your bug report in detail but you are right that there was 
a change in the Arrow message format for 0.15.0+. To maintain compatibility 
with current versions of Spark an environment variable can be set by adding it 
to conf/spark-env.sh :

{{ARROW_PRE_0_15_IPC_FORMAT=1}}

It's described a little more in the docs here 
[https://github.com/apache/spark/blob/master/docs/sql-pyspark-pandas-with-arrow.md#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x]

Could you try this out and see if it solves your issue?

> Failure when returning a value from multiple Pandas UDFs
> 
>
> Key: SPARK-30063
> URL: https://issues.apache.org/jira/browse/SPARK-30063
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3, 2.4.4
> Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on 
> both 2.4.3 and 2.4.4
>Reporter: Tim Kellogg
>Priority: Major
> Attachments: spark-debug.txt, variety-of-schemas.ipynb
>
>
> I have 20 Pandas UDFs that I'm trying to evaluate all at the same time.
>  * PandasUDFType.GROUPED_AGG
>  * 3 columns in the input data frame being serialized over Arrow to Python 
> worker. See below for clarification.
>  * All functions take 2 parameters, some combination of the 3 received as 
> Arrow input.
>  * Varying return types, see details below.
> _*I get an IllegalArgumentException on the Scala side of the worker when 
> deserializing from Python.*_
> h2. Exception & Stack Trace
> {code:java}
> 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
> java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, 
> localhost, executor driver): java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStrea

[jira] [Assigned] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-19 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-29691:


Assignee: John Bauer

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Assignee: John Bauer
>Priority: Minor
> Fix For: 3.0.0
>
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-19 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-29691.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26527
[https://github.com/apache/spark/pull/26527]

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
> Fix For: 3.0.0
>
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2019-11-19 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977749#comment-16977749
 ] 

Bryan Cutler commented on SPARK-29748:
--

[~zero323] and [~jhereth] this is targeted for Spark 3.0 and I agree, the 
behavior of Row should be very well defined to avoid any further confusion.

bq. Introducing {{LegacyRow}} seems to make little sense if implementation of 
{{Row}} stays the same otherwise. Sorting or not, depending on the config, 
should be enough.

LegacyRow isn't meant to be public and the user will not be aware of it. The 
reasons for it are to separate different implementations and make for a clean 
removal in the future without affecting the standard Row class. Having a 
separate implementation will make it easier to debug and diagnose problems - I 
don't want to get in the situation where a Row could sort fields or not, and 
then getting bug reports not knowing which way it was configured.

bq. I don't think we should introduce such behavior now, when 3.5 is 
deprecated. Having yet another way to initialize Row will be confusing at best 

That's reasonable. I'm not crazy about an option for OrderedDict as input, but 
I think users of Python < 3.6 should have a way to create a Row with ordered 
fields other than the 2-step process in the pydoc. We can explore other options 
for this.

bq. Make legacy behavior the only option for Python < 3.6.

I don't think we should have 2 very different behaviors that are chosen based 
on your Python verison. The user should be aware of what is happening and need 
to make the decision to use the legacy sorting. Some users will not know this, 
then upgrade their Python version and see Rows breaking. We should allow users 
with Python < 3.6 to make Rows with ordered fields and then be able to upgrade 
Python version without breaking their Spark app.

bq. For Python 3.6 let's introduce legacy sorting mechanism (keeping only 
single Row) class, enabled by default and deprecated.

Yeah, I'm not sure if we should enable the legacy sorting as default or not, 
what do others think?
 

> Remove sorting of fields in PySpark SQL Row creation
> 
>
> Key: SPARK-29748
> URL: https://issues.apache.org/jira/browse/SPARK-29748
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, when a PySpark Row is created with keyword arguments, the fields 
> are sorted alphabetically. This has created a lot of confusion with users 
> because it is not obvious (although it is stated in the pydocs) that they 
> will be sorted alphabetically, and then an error can occur later when 
> applying a schema and the field order does not match.
> The original reason for sorting fields is because kwargs in python < 3.6 are 
> not guaranteed to be in the same order that they were entered. Sorting 
> alphabetically would ensure a consistent order.  Matters are further 
> complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
> to be referenced by name when made by kwargs, but this flag is not serialized 
> with the Row and leads to inconsistent behavior.
> This JIRA proposes that any sorting of the Fields is removed. Users with 
> Python 3.6+ creating Rows with kwargs can continue to do so since Python will 
> ensure the order is the same as entered. Users with Python < 3.6 will have to 
> create Rows with an OrderedDict or by using the Row class as a factory 
> (explained in the pydoc).  If kwargs are used, an error will be raised or 
> based on a conf setting it can fall back to a LegacyRow that will sort the 
> fields as before. This LegacyRow will be immediately deprecated and removed 
> once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2019-11-14 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974678#comment-16974678
 ] 

Bryan Cutler commented on SPARK-29748:
--

Thanks for discussing [~zero323] . The goal here is to only remove the sorting 
of fields, which causes all kinds of weird inconsistencies like in your above 
example. I'd prefer to leave efficient field access for another time. Since Row 
is a subclass of tuple, accessing fields by name has never been efficient and I 
don't want to change the fundamental design here. The only reason to introduce 
LegacyRow (which will be deprecated) is to maintain backward compatibility with 
existing code that expects fields to be sorted.

> Remove sorting of fields in PySpark SQL Row creation
> 
>
> Key: SPARK-29748
> URL: https://issues.apache.org/jira/browse/SPARK-29748
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, when a PySpark Row is created with keyword arguments, the fields 
> are sorted alphabetically. This has created a lot of confusion with users 
> because it is not obvious (although it is stated in the pydocs) that they 
> will be sorted alphabetically, and then an error can occur later when 
> applying a schema and the field order does not match.
> The original reason for sorting fields is because kwargs in python < 3.6 are 
> not guaranteed to be in the same order that they were entered. Sorting 
> alphabetically would ensure a consistent order.  Matters are further 
> complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
> to be referenced by name when made by kwargs, but this flag is not serialized 
> with the Row and leads to inconsistent behavior.
> This JIRA proposes that any sorting of the Fields is removed. Users with 
> Python 3.6+ creating Rows with kwargs can continue to do so since Python will 
> ensure the order is the same as entered. Users with Python < 3.6 will have to 
> create Rows with an OrderedDict or by using the Row class as a factory 
> (explained in the pydoc).  If kwargs are used, an error will be raised or 
> based on a conf setting it can fall back to a LegacyRow that will sort the 
> fields as before. This LegacyRow will be immediately deprecated and removed 
> once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29493) Add MapType support for Arrow Java

2019-11-12 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972833#comment-16972833
 ] 

Bryan Cutler commented on SPARK-29493:
--

[~jalpan.randeri] this depends on SPARK-29376 for a newer version of Arrow. I 
was planning on doing this after because it might be a little tricky. If you 
want to take a look at SPARK-25351, I think that would be more straightforward 
to add.

> Add MapType support for Arrow Java
> --
>
> Key: SPARK-29493
> URL: https://issues.apache.org/jira/browse/SPARK-29493
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> This will add MapType support for Arrow in Spark ArrowConverters. This can 
> happen after the Arrow 0.15.0 upgrade, but MapType is not available in 
> pyarrow yet, so pyspark and pandas_udf will come later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2019-11-12 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reopened SPARK-25351:
--

reopening, this should be straightforward to add

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: bulk-closed
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29798) Infers bytes as binary type in Python 3 at PySpark

2019-11-08 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-29798.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26432
[https://github.com/apache/spark/pull/26432]

> Infers bytes as binary type in Python 3 at PySpark
> --
>
> Key: SPARK-29798
> URL: https://issues.apache.org/jira/browse/SPARK-29798
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, PySpark cannot infer {{bytes}} type in Python 3. This should be 
> accepted as binary type. See https://github.com/apache/spark/pull/25749



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29798) Infers bytes as binary type in Python 3 at PySpark

2019-11-08 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-29798:


Assignee: Hyukjin Kwon

> Infers bytes as binary type in Python 3 at PySpark
> --
>
> Key: SPARK-29798
> URL: https://issues.apache.org/jira/browse/SPARK-29798
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, PySpark cannot infer {{bytes}} type in Python 3. This should be 
> accepted as binary type. See https://github.com/apache/spark/pull/25749



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29376) Upgrade Apache Arrow to 0.15.1

2019-11-08 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-29376:
-
Description: 
Apache Arrow 0.15.0 was just released see 
[https://arrow.apache.org/blog/2019/10/06/0.15.0-release/]

There are a number of fixes and improvements including a change to the binary 
IPC format https://issues.apache.org/jira/browse/ARROW-6313.

The next planned release will be 1.0.0, so it would be good to upgrade Spark as 
a preliminary step.

Updated to use Apache Arrow 0.15.1, which was released soon after 0.15.0 to 
include some important bug fixes.

change log at https://arrow.apache.org/release/0.15.1.html

  was:
Apache Arrow 0.15.0 was just released see 
https://arrow.apache.org/blog/2019/10/06/0.15.0-release/

There are a number of fixes and improvements including a change to the binary 
IPC format https://issues.apache.org/jira/browse/ARROW-6313.

The next planned release will be 1.0.0, so it would be good to upgrade Spark as 
a preliminary step.


> Upgrade Apache Arrow to 0.15.1
> --
>
> Key: SPARK-29376
> URL: https://issues.apache.org/jira/browse/SPARK-29376
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Apache Arrow 0.15.0 was just released see 
> [https://arrow.apache.org/blog/2019/10/06/0.15.0-release/]
> There are a number of fixes and improvements including a change to the binary 
> IPC format https://issues.apache.org/jira/browse/ARROW-6313.
> The next planned release will be 1.0.0, so it would be good to upgrade Spark 
> as a preliminary step.
> Updated to use Apache Arrow 0.15.1, which was released soon after 0.15.0 to 
> include some important bug fixes.
> change log at https://arrow.apache.org/release/0.15.1.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 7 8 >

1 - 100 of 780 matches

Mail list logo