date:20191114

[jira] [Resolved] (SPARK-29655) Enable adaptive execution should not add more ShuffleExchange

2019-11-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29655.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26409
[https://github.com/apache/spark/pull/26409]

> Enable adaptive execution should not add more ShuffleExchange
> -
>
> Key: SPARK-29655
> URL: https://issues.apache.org/jira/browse/SPARK-29655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Enable adaptive execution should not add more ShuffleExchange. How to 
> reproduce:
> {code:scala}
> import org.apache.spark.sql.SaveMode
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> spark.conf.set("spark.sql.shuffle.partitions", 4)
> val bucketedTableName = "bucketed_table"
> spark.range(10).write.bucketBy(4, 
> "id").sortBy("id").mode(SaveMode.Overwrite).saveAsTable(bucketedTableName)
> val bucketedTable = spark.table(bucketedTableName)
> val df = spark.range(4)
> df.join(bucketedTable, "id").explain()
> spark.conf.set("spark.sql.adaptive.enabled", true)
> spark.conf.set("spark.sql.adaptive.shuffle.maxNumPostShufflePartitions", 5)
> df.join(bucketedTable, "id").explain()
> {code}
> Output:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan(isFinalPlan=false)
> +- Project [id#5L]
>+- SortMergeJoin [id#5L], [id#3L], Inner
>   :- Sort [id#5L ASC NULLS FIRST], false, 0
>   :  +- Exchange hashpartitioning(id#5L, 5), true, [id=#92]
>   : +- Range (0, 4, step=1, splits=16)
>   +- Sort [id#3L ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(id#3L, 5), true, [id=#93]
> +- Project [id#3L]
>+- Filter isnotnull(id#3L)
>   +- FileScan parquet default.bucketed_table[id#3L] Batched: 
> true, DataFilters: [isnotnull(id#3L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/spark-3.0.0-preview-bin-hadoop3.2/spark-warehouse/bucketed_table],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 4 out of 4
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29655) Enable adaptive execution should not add more ShuffleExchange

2019-11-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29655:
---

Assignee: Yuming Wang

> Enable adaptive execution should not add more ShuffleExchange
> -
>
> Key: SPARK-29655
> URL: https://issues.apache.org/jira/browse/SPARK-29655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Enable adaptive execution should not add more ShuffleExchange. How to 
> reproduce:
> {code:scala}
> import org.apache.spark.sql.SaveMode
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> spark.conf.set("spark.sql.shuffle.partitions", 4)
> val bucketedTableName = "bucketed_table"
> spark.range(10).write.bucketBy(4, 
> "id").sortBy("id").mode(SaveMode.Overwrite).saveAsTable(bucketedTableName)
> val bucketedTable = spark.table(bucketedTableName)
> val df = spark.range(4)
> df.join(bucketedTable, "id").explain()
> spark.conf.set("spark.sql.adaptive.enabled", true)
> spark.conf.set("spark.sql.adaptive.shuffle.maxNumPostShufflePartitions", 5)
> df.join(bucketedTable, "id").explain()
> {code}
> Output:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan(isFinalPlan=false)
> +- Project [id#5L]
>+- SortMergeJoin [id#5L], [id#3L], Inner
>   :- Sort [id#5L ASC NULLS FIRST], false, 0
>   :  +- Exchange hashpartitioning(id#5L, 5), true, [id=#92]
>   : +- Range (0, 4, step=1, splits=16)
>   +- Sort [id#3L ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(id#3L, 5), true, [id=#93]
> +- Project [id#3L]
>+- Filter isnotnull(id#3L)
>   +- FileScan parquet default.bucketed_table[id#3L] Batched: 
> true, DataFilters: [isnotnull(id#3L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/spark-3.0.0-preview-bin-hadoop3.2/spark-warehouse/bucketed_table],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 4 out of 4
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29911) Cache table may memory leak when session closed

2019-11-14 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29911:
---
Summary: Cache table may memory leak when session closed  (was: Cache table 
may memory leak when session stopped)

> Cache table may memory leak when session closed
> ---
>
> Key: SPARK-29911
> URL: https://issues.apache.org/jira/browse/SPARK-29911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png
>
>
> How to reproduce:
> 1. create a local temporary view v1
> 2. cache it in memory
> 3. close session without drop v1.
> The application will hold the memory forever. In a long running thrift server 
> scenario. It's worse.
> {code}
> 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1;
> CACHE TABLE testCacheTable AS SELECT 1;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (1.498 seconds)
> 0: jdbc:hive2://localhost:1> !close
> !close
> Closing: 0: jdbc:hive2://localhost:1
> 0: jdbc:hive2://localhost:1 (closed)> !connect 
> 'jdbc:hive2://localhost:1'
> !connect 'jdbc:hive2://localhost:1'
> Connecting to jdbc:hive2://localhost:1
> Enter username for jdbc:hive2://localhost:1:
> lajin
> Enter password for jdbc:hive2://localhost:1:
> ***
> Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 1: jdbc:hive2://localhost:1> select * from testCacheTable;
> select * from testCacheTable;
> Error: Error running query: org.apache.spark.sql.AnalysisException: Table or 
> view not found: testCacheTable; line 1 pos 14;
> 'Project [*]
> +- 'UnresolvedRelation [testCacheTable] (state=,code=0)
> {code}
>  !Screen Shot 2019-11-15 at 2.03.49 PM.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29911) Cache table may memory leak when session stopped

2019-11-14 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29911:
---
Description: 
How to reproduce:
1. create a local temporary view v1
2. cache it in memory
3. close session without drop v1.
The application will hold the memory forever. In a long running thrift server 
scenario. It's worse.
{code}
0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1;
CACHE TABLE testCacheTable AS SELECT 1;
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (1.498 seconds)
0: jdbc:hive2://localhost:1> !close
!close
Closing: 0: jdbc:hive2://localhost:1
0: jdbc:hive2://localhost:1 (closed)> !connect 
'jdbc:hive2://localhost:1'
!connect 'jdbc:hive2://localhost:1'
Connecting to jdbc:hive2://localhost:1
Enter username for jdbc:hive2://localhost:1:
lajin
Enter password for jdbc:hive2://localhost:1:
***
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
1: jdbc:hive2://localhost:1> select * from testCacheTable;
select * from testCacheTable;
Error: Error running query: org.apache.spark.sql.AnalysisException: Table or 
view not found: testCacheTable; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [testCacheTable] (state=,code=0)
{code}
 !Screen Shot 2019-11-15 at 2.03.49 PM.png! 

  was:
How to reproduce:
1. create a local temporary view v1
2. cache it in memory
3. close session without drop v1.
The application will hold the memory forever. In a long running thrift server 
scenario. It's worse.
{code}
0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1;
CACHE TABLE testCacheTable AS SELECT 1;
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (1.498 seconds)
0: jdbc:hive2://localhost:1> !close
!close
Closing: 0: jdbc:hive2://localhost:1
0: jdbc:hive2://localhost:1 (closed)> !connect 
'jdbc:hive2://localhost:1'
!connect 'jdbc:hive2://localhost:1'
Connecting to jdbc:hive2://localhost:1
Enter username for jdbc:hive2://localhost:1:
lajin
Enter password for jdbc:hive2://localhost:1:
***
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
1: jdbc:hive2://localhost:1> select * from testCacheTable;
select * from testCacheTable;
Error: Error running query: org.apache.spark.sql.AnalysisException: Table or 
view not found: testCacheTable; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [testCacheTable] (state=,code=0)
{code}



> Cache table may memory leak when session stopped
> 
>
> Key: SPARK-29911
> URL: https://issues.apache.org/jira/browse/SPARK-29911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png
>
>
> How to reproduce:
> 1. create a local temporary view v1
> 2. cache it in memory
> 3. close session without drop v1.
> The application will hold the memory forever. In a long running thrift server 
> scenario. It's worse.
> {code}
> 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1;
> CACHE TABLE testCacheTable AS SELECT 1;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (1.498 seconds)
> 0: jdbc:hive2://localhost:1> !close
> !close
> Closing: 0: jdbc:hive2://localhost:1
> 0: jdbc:hive2://localhost:1 (closed)> !connect 
> 'jdbc:hive2://localhost:1'
> !connect 'jdbc:hive2://localhost:1'
> Connecting to jdbc:hive2://localhost:1
> Enter username for jdbc:hive2://localhost:1:
> lajin
> Enter password for jdbc:hive2://localhost:1:
> ***
> Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 1: jdbc:hive2://localhost:1> select * from testCacheTable;
> select * from testCacheTable;
> Error: Error running query: org.apache.spark.sql.AnalysisException: Table or 
> view not found: testCacheTable; line 1 pos 14;
> 'Project [*]
> +- 'UnresolvedRelation [testCacheTable] (state=,code=0)
> {code}
>  !Screen Shot 2019-11-15 at 2.03.49 PM.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29911) Cache table may memory leak when session stopped

2019-11-14 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29911:
---
Attachment: Screen Shot 2019-11-15 at 2.03.49 PM.png

> Cache table may memory leak when session stopped
> 
>
> Key: SPARK-29911
> URL: https://issues.apache.org/jira/browse/SPARK-29911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png
>
>
> How to reproduce:
> 1. create a local temporary view v1
> 2. cache it in memory
> 3. close session without drop v1.
> The application will hold the memory forever. In a long running thrift server 
> scenario. It's worse.
> {code}
> 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1;
> CACHE TABLE testCacheTable AS SELECT 1;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (1.498 seconds)
> 0: jdbc:hive2://localhost:1> !close
> !close
> Closing: 0: jdbc:hive2://localhost:1
> 0: jdbc:hive2://localhost:1 (closed)> !connect 
> 'jdbc:hive2://localhost:1'
> !connect 'jdbc:hive2://localhost:1'
> Connecting to jdbc:hive2://localhost:1
> Enter username for jdbc:hive2://localhost:1:
> lajin
> Enter password for jdbc:hive2://localhost:1:
> ***
> Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 1: jdbc:hive2://localhost:1> select * from testCacheTable;
> select * from testCacheTable;
> Error: Error running query: org.apache.spark.sql.AnalysisException: Table or 
> view not found: testCacheTable; line 1 pos 14;
> 'Project [*]
> +- 'UnresolvedRelation [testCacheTable] (state=,code=0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29911) Cache table may memory leak when session stopped

2019-11-14 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29911:
---
Description: 
How to reproduce:
1. create a local temporary view v1
2. cache it in memory
3. close session without drop v1.
The application will hold the memory forever. In a long running thrift server 
scenario. It's worse.
{code}
0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1;
CACHE TABLE testCacheTable AS SELECT 1;
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (1.498 seconds)
0: jdbc:hive2://localhost:1> !close
!close
Closing: 0: jdbc:hive2://localhost:1
0: jdbc:hive2://localhost:1 (closed)> !connect 
'jdbc:hive2://localhost:1'
!connect 'jdbc:hive2://localhost:1'
Connecting to jdbc:hive2://localhost:1
Enter username for jdbc:hive2://localhost:1:
lajin
Enter password for jdbc:hive2://localhost:1:
***
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
1: jdbc:hive2://localhost:1> select * from testCacheTable;
select * from testCacheTable;
Error: Error running query: org.apache.spark.sql.AnalysisException: Table or 
view not found: testCacheTable; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [testCacheTable] (state=,code=0)
{code}


  was:
How to reproduce:
1. create a local temporary view v1
2. cache it in memory
3. close session without drop v1.
The application will hold the memory forever. In a long running thrift server 
scenario. It's worse.
{code}
0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1;
CACHE TABLE testCacheTable AS SELECT 1;
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (1.498 seconds)
0: jdbc:hive2://localhost:1> !close
!close
Closing: 0: jdbc:hive2://localhost:1
0: jdbc:hive2://localhost:1 (closed)> !connect 
'jdbc:hive2://localhost:1'
!connect 'jdbc:hive2://localhost:1'
Connecting to jdbc:hive2://localhost:1
Enter username for jdbc:hive2://localhost:1: lajin
lajin
Enter password for jdbc:hive2://localhost:1: 123
***
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
1: jdbc:hive2://localhost:1> select * from testCacheTable;
select * from testCacheTable;
Error: Error running query: org.apache.spark.sql.AnalysisException: Table or 
view not found: testCacheTable; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [testCacheTable] (state=,code=0)
{code}



> Cache table may memory leak when session stopped
> 
>
> Key: SPARK-29911
> URL: https://issues.apache.org/jira/browse/SPARK-29911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png
>
>
> How to reproduce:
> 1. create a local temporary view v1
> 2. cache it in memory
> 3. close session without drop v1.
> The application will hold the memory forever. In a long running thrift server 
> scenario. It's worse.
> {code}
> 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1;
> CACHE TABLE testCacheTable AS SELECT 1;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (1.498 seconds)
> 0: jdbc:hive2://localhost:1> !close
> !close
> Closing: 0: jdbc:hive2://localhost:1
> 0: jdbc:hive2://localhost:1 (closed)> !connect 
> 'jdbc:hive2://localhost:1'
> !connect 'jdbc:hive2://localhost:1'
> Connecting to jdbc:hive2://localhost:1
> Enter username for jdbc:hive2://localhost:1:
> lajin
> Enter password for jdbc:hive2://localhost:1:
> ***
> Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 1: jdbc:hive2://localhost:1> select * from testCacheTable;
> select * from testCacheTable;
> Error: Error running query: org.apache.spark.sql.AnalysisException: Table or 
> view not found: testCacheTable; line 1 pos 14;
> 'Project [*]
> +- 'UnresolvedRelation [testCacheTable] (state=,code=0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29911) Cache table may memory leak when session stopped

2019-11-14 Thread Lantao Jin (Jira)

Lantao Jin created SPARK-29911:
--

 Summary: Cache table may memory leak when session stopped
 Key: SPARK-29911
 URL: https://issues.apache.org/jira/browse/SPARK-29911
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin
 Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png

How to reproduce:
1. create a local temporary view v1
2. cache it in memory
3. close session without drop v1.
The application will hold the memory forever. In a long running thrift server 
scenario. It's worse.
{code}
0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1;
CACHE TABLE testCacheTable AS SELECT 1;
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (1.498 seconds)
0: jdbc:hive2://localhost:1> !close
!close
Closing: 0: jdbc:hive2://localhost:1
0: jdbc:hive2://localhost:1 (closed)> !connect 
'jdbc:hive2://localhost:1'
!connect 'jdbc:hive2://localhost:1'
Connecting to jdbc:hive2://localhost:1
Enter username for jdbc:hive2://localhost:1: lajin
lajin
Enter password for jdbc:hive2://localhost:1: 123
***
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
1: jdbc:hive2://localhost:1> select * from testCacheTable;
select * from testCacheTable;
Error: Error running query: org.apache.spark.sql.AnalysisException: Table or 
view not found: testCacheTable; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [testCacheTable] (state=,code=0)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29910) Add minimum runtime limit to speculation

2019-11-14 Thread Deegue (Jira)

Deegue created SPARK-29910:
--

 Summary: Add minimum runtime limit to speculation
 Key: SPARK-29910
 URL: https://issues.apache.org/jira/browse/SPARK-29910
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Deegue


The minimum runtime to speculation used to be a fixed value 100ms.  It means 
tasks finished in seconds will also be speculated and more executors will be 
required.
To resolve this, we add `spark.speculation.minRuntime` to control the minimum 
runtime limit for speculation.
We can reduce normal tasks to be speculated by adjusting 
`spark.speculation.minRuntime`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2019-11-14 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974836#comment-16974836
 ] 

koert kuipers commented on SPARK-29906:
---

i added a bit of debug logging:
{code:java}
$ git diff
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
index 375cec5971..7e5b7fb235 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
@@ -86,7 +86,7 @@ object CSVDataSource extends Logging {
   }
 }
 
-object TextInputCSVDataSource extends CSVDataSource {
+object TextInputCSVDataSource extends CSVDataSource with Logging {
   override val isSplitable: Boolean = true
 
   override def readFile(
@@ -110,9 +110,13 @@ object TextInputCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): StructType = {
+logInfo(s"!! inputPaths ${inputPaths}")
 val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions)
 val maybeFirstLine = CSVUtils.filterCommentAndEmpty(csv, 
parsedOptions).take(1).headOption
-inferFromDataset(sparkSession, csv, maybeFirstLine, parsedOptions)
+logInfo(s"!! maybeFirstLine ${maybeFirstLine}")
+val schema = inferFromDataset(sparkSession, csv, maybeFirstLine, 
parsedOptions)
+logInfo(s"!! schema ${schema}")
+schema
   }
{code}

and this shows when spark.sql.adaptive.enabled=true:

{code:java}
19/11/15 05:52:06 INFO csv.TextInputCSVDataSource: !! inputPaths 
List(LocatedFileStatus{path=hdfs://ip-xx-xxx-x-xxx.ec2.internal:8020/user/hadoop/OP_DTL_GNRL_PGYR2013_P06282019.csv;
 isDirectory=false; length=2242114396; replication=3; blocksize=134217728; 
modification_time=1573794115499; access_time=1573794109887; owner=hadoop; 
group=hadoop; permission=rw-r--r--; isSymlink=false})
19/11/15 05:52:10 INFO csv.TextInputCSVDataSource: !! maybeFirstLine 
Some("UNCHANGED","Covered Recipient 
Physician""195068","SCOTT","KEVIN","FORMAN",,"360 SAN MIGUEL DR","SUITE 
701","NEWPORT BEACH","CA","92660-7853","United States",,,"Medical 
Doctor","Allopathic & Osteopathic Physicians|Orthopaedic 
Surgery","CA","Wright Medical Technology, Inc.","10011065","Wright 
Medical Technology, Inc.","TN","United States",12.50,"08/20/2013","1","In-kind 
items and services","Food and Beverage""No","No Third Party 
Payment",,"No",,,"No","105165962","No","Covered","Foot and 
Ankle",,,"2013","06/28/2019")
19/11/15 05:52:10 INFO csv.TextInputCSVDataSource: !! schema 
StructType(StructField(UNCHANGED,StringType,true), StructField(Covered 
Recipient Physician,StringType,true), StructField(_c2,StringType,true), 
StructField(_c3,StringType,true), StructField(_c4,StringType,true), 
StructField(195068,StringType,true), StructField(SCOTT,StringType,true), 
StructField(KEVIN,StringType,true), StructField(FORMAN,StringType,true), 
StructField(_c9,StringType,true), StructField(360 SAN MIGUEL 
DR,StringType,true), StructField(SUITE 701,StringType,true), 
StructField(NEWPORT BEACH,StringType,true), StructField(CA13,StringType,true), 
StructField(92660-7853,StringType,true), StructField(United 
States15,StringType,true), StructField(_c16,StringType,true), 
StructField(_c17,StringType,true), StructField(Medical Doctor,StringType,true), 
StructField(Allopathic & Osteopathic Physicians|Orthopaedic 
Surgery,StringType,true), StructField(CA20,StringType,true), 
StructField(_c21,StringType,true), StructField(_c22,StringType,true), 
StructField(_c23,StringType,true), StructField(_c24,StringType,true), 
StructField(Wright Medical Technology, Inc.25,StringType,true), 
StructField(10011065,StringType,true), StructField(Wright Medical 
Technology, Inc.27,StringType,true), StructField(TN,StringType,true), 
StructField(United States29,StringType,true), 
StructField(12.50,StringType,true), StructField(08/20/2013,StringType,true), 
StructField(1,StringType,true), StructField(In-kind items and 
services,StringType,true), StructField(Food and Beverage,StringType,true), 
StructField(_c35,StringType,true), StructField(_c36,StringType,true), 
StructField(_c37,StringType,true), StructField(No38,StringType,true), 
StructField(No Third Party Payment,StringType,true), 
StructField(_c40,StringType,true), StructField(No41,StringType,true), 
StructField(_c42,StringType,true), StructField(_c43,StringType,true), 
StructField(No44,StringType,true), StructField(105165962,StringType,true), 
StructField(No46,StringType,true), StructField(Covered,StringType,true), 
StructField(Foot and Ankle,StringType,true), StructField(_c49,StringType,true), 
StructField(_c50,StringType,true),

[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-14 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974824#comment-16974824
 ] 

Terry Kim commented on SPARK-29900:
---

Cool. I will compile the list and send it out to dev/user list. Thanks!

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-14 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974822#comment-16974822
 ] 

Wenchen Fan commented on SPARK-29900:
-

Yea exactly!

I don't think it's a big breaking change. We only break the cases when there 
are temp view and table with the same name, and users can use a qualified name 
to disambiguate.

To move this forward, we need to:
1. find all the places that need to change the table resolution behavior (e.g. 
saveAsTable, DROP TABLE)
2. propose it to dev/user list
3. implement it

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2019-11-14 Thread Joachim Hereth (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974821#comment-16974821
 ] 

Joachim Hereth commented on SPARK-29748:


[~bryanc] With simply removing sorting we change the semantics, e.g. `Row(a=1, 
b=2) != Row(b=2, a=1)` (opposed to what we currently have.

Also, there might be problems if data was written with Spark pre-change and 
read after the change.

Adding workarounds (if possible) will make the code very complex.

I think [~zero323] was thinking about changes for the upcoming 3.0?

> Remove sorting of fields in PySpark SQL Row creation
> 
>
> Key: SPARK-29748
> URL: https://issues.apache.org/jira/browse/SPARK-29748
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, when a PySpark Row is created with keyword arguments, the fields 
> are sorted alphabetically. This has created a lot of confusion with users 
> because it is not obvious (although it is stated in the pydocs) that they 
> will be sorted alphabetically, and then an error can occur later when 
> applying a schema and the field order does not match.
> The original reason for sorting fields is because kwargs in python < 3.6 are 
> not guaranteed to be in the same order that they were entered. Sorting 
> alphabetically would ensure a consistent order.  Matters are further 
> complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
> to be referenced by name when made by kwargs, but this flag is not serialized 
> with the Row and leads to inconsistent behavior.
> This JIRA proposes that any sorting of the Fields is removed. Users with 
> Python 3.6+ creating Rows with kwargs can continue to do so since Python will 
> ensure the order is the same as entered. Users with Python < 3.6 will have to 
> create Rows with an OrderedDict or by using the Row class as a factory 
> (explained in the pydoc).  If kwargs are used, an error will be raised or 
> based on a conf setting it can fall back to a LegacyRow that will sort the 
> fields as before. This LegacyRow will be immediately deprecated and removed 
> once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29888) New interval string parser parse '.111 seconds' to null

2019-11-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29888:
---

Assignee: Kent Yao

> New interval string parser parse '.111 seconds' to null 
> 
>
> Key: SPARK-29888
> URL: https://issues.apache.org/jira/browse/SPARK-29888
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Current string to interval cast logic does not support i.e. cast('.111 
> second' as interval) which will fail in SIGN state and return null, actually, 
> it is 00:00:00.111. 
> {code:java}
> These are the results of the master branch.
> -- !query 63
> select interval '.111 seconds'
> -- !query 63 schema
> struct<0.111 seconds:interval>
> -- !query 63 output
> 0.111 seconds
> -- !query 64
> select cast('.111 seconds' as interval)
> -- !query 64 schema
> struct
> -- !query 64 output
> NULL
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29888) New interval string parser parse '.111 seconds' to null

2019-11-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29888.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26514
[https://github.com/apache/spark/pull/26514]

> New interval string parser parse '.111 seconds' to null 
> 
>
> Key: SPARK-29888
> URL: https://issues.apache.org/jira/browse/SPARK-29888
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Current string to interval cast logic does not support i.e. cast('.111 
> second' as interval) which will fail in SIGN state and return null, actually, 
> it is 00:00:00.111. 
> {code:java}
> These are the results of the master branch.
> -- !query 63
> select interval '.111 seconds'
> -- !query 63 schema
> struct<0.111 seconds:interval>
> -- !query 63 output
> 0.111 seconds
> -- !query 64
> select cast('.111 seconds' as interval)
> -- !query 64 schema
> struct
> -- !query 64 output
> NULL
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28859) Remove value check of MEMORY_OFFHEAP_SIZE in declaration section

2019-11-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28859.
---
  Assignee: (was: yifan)
Resolution: Invalid

According to the the test failures on the PR , I'm closing this issue as 
`Invalid`.
Since `MemoryManager` validates this already, it's enough.

> Remove value check of MEMORY_OFFHEAP_SIZE in declaration section
> 
>
> Key: SPARK-28859
> URL: https://issues.apache.org/jira/browse/SPARK-28859
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> Now MEMORY_OFFHEAP_SIZE has default value 0, but It should be greater than 0 
> when 
> MEMORY_OFFHEAP_ENABLED is true,, should we check this condition in code?
>  
> SPARK-28577 add this check before request memory resource to Yarn 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29892) Add built-in Array Functions: array_cat

2019-11-14 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974800#comment-16974800
 ] 

Aman Omer commented on SPARK-29892:
---

This Jira is a duplicate of https://issues.apache.org/jira/browse/SPARK-29737 .

> Add built-in Array Functions: array_cat
> ---
>
> Key: SPARK-29892
> URL: https://issues.apache.org/jira/browse/SPARK-29892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> |{{array_cat}}{{(}}{{anyarray}}{{, 
> }}{{anyarray}}{{)}}|{{anyarray}}|concatenate two 
> arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{\{ {1,2,3,4,5}}}|
> Other DBs:
> [https://phoenix.apache.org/language/functions.html#array_cat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27884) Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27884:
-
Description: 
Officially deprecate Python 2 support and and Python 3 prior to 3.6 in Spark 
3.0.

dev list:
http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCEMENT-Plan-for-dropping-Python-2-support-td27335.html
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Deprecate-Python-lt-3-6-in-Spark-3-0-td28168.html

  was:Officially deprecate Python 2 support and and Python 3 prior to 3.6 in 
Spark 3.0.


> Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0
> -
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Officially deprecate Python 2 support and and Python 3 prior to 3.6 in Spark 
> 3.0.
> dev list:
> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCEMENT-Plan-for-dropping-Python-2-support-td27335.html
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Deprecate-Python-lt-3-6-in-Spark-3-0-td28168.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27884) Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27884:
-
Description: Officially deprecate Python 2 support and and Python 3 prior 
to 3.6 in Spark 3.0.  (was: Officially deprecate Python 2 support in Spark 3.0.)

> Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0
> -
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Officially deprecate Python 2 support and and Python 3 prior to 3.6 in Spark 
> 3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27884) Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27884.
--
Fix Version/s: 3.0.0
   Resolution: Done

> Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0
> -
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Officially deprecate Python 2 support in Spark 3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29803) remove all instances of 'from future import print_function'

2019-11-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974791#comment-16974791
 ] 

Hyukjin Kwon commented on SPARK-29803:
--

(I converted into a subtask of a new JIRA)

> remove all instances of 'from __future__ import print_function' 
> 
>
> Key: SPARK-29803
> URL: https://issues.apache.org/jira/browse/SPARK-29803
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
> Attachments: print_function_list.txt
>
>
> there are 135 python files in the spark repo that need to have `from 
> __future__ import print_function` removed (see attached file 
> 'print_function_list.txt').
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29803) remove all instances of 'from future import print_function'

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29803:
-
Parent: SPARK-29909
Issue Type: Sub-task  (was: Bug)

> remove all instances of 'from __future__ import print_function' 
> 
>
> Key: SPARK-29803
> URL: https://issues.apache.org/jira/browse/SPARK-29803
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
> Attachments: print_function_list.txt
>
>
> there are 135 python files in the spark repo that need to have `from 
> __future__ import print_function` removed (see attached file 
> 'print_function_list.txt').
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29897) Implicit cast to timestamp is failing

2019-11-14 Thread Ankit Raj Boudh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974789#comment-16974789
 ] 

Ankit Raj Boudh commented on SPARK-29897:
-

i will raise PR for this

> Implicit cast to timestamp is failing 
> --
>
> Key: SPARK-29897
> URL: https://issues.apache.org/jira/browse/SPARK-29897
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark can not cast implicitly
>  jdbc:hive2://10.18.19.208:23040/default> SELECT EXTRACT(DAY FROM NOW() - 
> '2014-08-02 08:10:56');
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 
> '(current_timestamp() - CAST('2014-08-02 08:10:56' AS DOUBLE))' due to data 
> type mismatch: differing types in '(current_timestamp() - CAST('2014-08-02 
> 08:10:56' AS DOUBLE))' (timestamp and double).; line 1 pos 24;
> PostgreSQL and MySQL can handle the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29803) remove all instances of 'from future import print_function'

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29803:
-
Parent: (was: SPARK-27884)
Issue Type: Bug  (was: Sub-task)

> remove all instances of 'from __future__ import print_function' 
> 
>
> Key: SPARK-29803
> URL: https://issues.apache.org/jira/browse/SPARK-29803
> Project: Spark
>  Issue Type: Bug
>  Components: Build, PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
> Attachments: print_function_list.txt
>
>
> there are 135 python files in the spark repo that need to have `from 
> __future__ import print_function` removed (see attached file 
> 'print_function_list.txt').
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29802) update remaining python scripts in repo to python3 shebang

2019-11-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974788#comment-16974788
 ] 

Hyukjin Kwon commented on SPARK-29802:
--

(I converted into a subtask of a new JIRA)

> update remaining python scripts in repo to python3 shebang
> --
>
> Key: SPARK-29802
> URL: https://issues.apache.org/jira/browse/SPARK-29802
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
>
> there are a bunch of scripts in the repo that need to have their shebang 
> updated to python3:
> {noformat}
> dev/create-release/releaseutils.py:#!/usr/bin/env python
> dev/create-release/generate-contributors.py:#!/usr/bin/env python
> dev/create-release/translate-contributors.py:#!/usr/bin/env python
> dev/github_jira_sync.py:#!/usr/bin/env python
> dev/merge_spark_pr.py:#!/usr/bin/env python
> python/pyspark/version.py:#!/usr/bin/env python
> python/pyspark/find_spark_home.py:#!/usr/bin/env python
> python/setup.py:#!/usr/bin/env python{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29909) Drop Python 2 and Python 3.4 and 3.5.

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29909:
-
Issue Type: Umbrella  (was: Bug)

> Drop Python 2 and Python 3.4 and 3.5.
> -
>
> Key: SPARK-29909
> URL: https://issues.apache.org/jira/browse/SPARK-29909
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We deprecated PySpark at SPARK-27884. We should drop at Spark 3.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29802) update remaining python scripts in repo to python3 shebang

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29802:
-
Parent: (was: SPARK-27884)
Issue Type: Bug  (was: Sub-task)

> update remaining python scripts in repo to python3 shebang
> --
>
> Key: SPARK-29802
> URL: https://issues.apache.org/jira/browse/SPARK-29802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
>
> there are a bunch of scripts in the repo that need to have their shebang 
> updated to python3:
> {noformat}
> dev/create-release/releaseutils.py:#!/usr/bin/env python
> dev/create-release/generate-contributors.py:#!/usr/bin/env python
> dev/create-release/translate-contributors.py:#!/usr/bin/env python
> dev/github_jira_sync.py:#!/usr/bin/env python
> dev/merge_spark_pr.py:#!/usr/bin/env python
> python/pyspark/version.py:#!/usr/bin/env python
> python/pyspark/find_spark_home.py:#!/usr/bin/env python
> python/setup.py:#!/usr/bin/env python{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29802) update remaining python scripts in repo to python3 shebang

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29802:
-
Parent: SPARK-29909
Issue Type: Sub-task  (was: Bug)

> update remaining python scripts in repo to python3 shebang
> --
>
> Key: SPARK-29802
> URL: https://issues.apache.org/jira/browse/SPARK-29802
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
>
> there are a bunch of scripts in the repo that need to have their shebang 
> updated to python3:
> {noformat}
> dev/create-release/releaseutils.py:#!/usr/bin/env python
> dev/create-release/generate-contributors.py:#!/usr/bin/env python
> dev/create-release/translate-contributors.py:#!/usr/bin/env python
> dev/github_jira_sync.py:#!/usr/bin/env python
> dev/merge_spark_pr.py:#!/usr/bin/env python
> python/pyspark/version.py:#!/usr/bin/env python
> python/pyspark/find_spark_home.py:#!/usr/bin/env python
> python/setup.py:#!/usr/bin/env python{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29909) Drop Python 2 and Python 3.4 and 3.5.

2019-11-14 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-29909:


 Summary: Drop Python 2 and Python 3.4 and 3.5.
 Key: SPARK-29909
 URL: https://issues.apache.org/jira/browse/SPARK-29909
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


We deprecated PySpark at SPARK-27884. We should drop at Spark 3.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28752) Documentation build script to support Python 3

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28752.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26521
[https://github.com/apache/spark/pull/26521]

> Documentation build script to support Python 3
> --
>
> Key: SPARK-28752
> URL: https://issues.apache.org/jira/browse/SPARK-28752
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Seems documentation build: 
> https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html
> doesn't support Python 3. We should support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28752) Documentation build script to support Python 3

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28752:


Assignee: Hyukjin Kwon

> Documentation build script to support Python 3
> --
>
> Key: SPARK-28752
> URL: https://issues.apache.org/jira/browse/SPARK-28752
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Seems documentation build: 
> https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html
> doesn't support Python 3. We should support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29376) Upgrade Apache Arrow to 0.15.1

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29376:
-
Fix Version/s: 3.0.0

> Upgrade Apache Arrow to 0.15.1
> --
>
> Key: SPARK-29376
> URL: https://issues.apache.org/jira/browse/SPARK-29376
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> Apache Arrow 0.15.0 was just released see 
> [https://arrow.apache.org/blog/2019/10/06/0.15.0-release/]
> There are a number of fixes and improvements including a change to the binary 
> IPC format https://issues.apache.org/jira/browse/ARROW-6313.
> The next planned release will be 1.0.0, so it would be good to upgrade Spark 
> as a preliminary step.
> Updated to use Apache Arrow 0.15.1, which was released soon after 0.15.0 to 
> include some important bug fixes.
> change log at https://arrow.apache.org/release/0.15.1.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29376) Upgrade Apache Arrow to 0.15.1

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29376.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/26133

> Upgrade Apache Arrow to 0.15.1
> --
>
> Key: SPARK-29376
> URL: https://issues.apache.org/jira/browse/SPARK-29376
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Apache Arrow 0.15.0 was just released see 
> [https://arrow.apache.org/blog/2019/10/06/0.15.0-release/]
> There are a number of fixes and improvements including a change to the binary 
> IPC format https://issues.apache.org/jira/browse/ARROW-6313.
> The next planned release will be 1.0.0, so it would be good to upgrade Spark 
> as a preliminary step.
> Updated to use Apache Arrow 0.15.1, which was released soon after 0.15.0 to 
> include some important bug fixes.
> change log at https://arrow.apache.org/release/0.15.1.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29908) Add a Python, Pandas and PyArrow versions in clue at SQL query tests

2019-11-14 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-29908:


 Summary: Add a Python, Pandas and PyArrow versions in clue at SQL 
query tests
 Key: SPARK-29908
 URL: https://issues.apache.org/jira/browse/SPARK-29908
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Once Python test cases is failed in integrated UDF test cases, it's difficult 
to find out the version informations. See 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/
 as an example

It might be better to add the version information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2019-11-14 Thread koert kuipers (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-29906:
--
Description: 
we observed an issue where spark seems to confuse a data line (not the first 
line of the csv file) for the csv header when it creates the schema.

{code}
$ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP
$ unzip PGYR13_P062819.ZIP
$ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv
$ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf 
spark.sql.adaptive.enabled=true --num-executors 10
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040
Spark context available as 'sc' (master = yarn, app id = 
application_1573772077642_0006).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.format("csv").option("header", true).option("enforceSchema", 
false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1)
19/11/15 00:27:10 WARN util.package: Truncated the string representation of a 
plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
[Stage 2:>(0 + 10) / 
17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 
(TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): 
java.lang.IllegalArgumentException: CSV header does not conform to the schema.
 Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, 
Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, 
Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, 
Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, 
Recipient_Primary_Business_Street_Address_Line2, Recipient_City, 
Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, 
Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, 
Physician_License_State_code1, Physician_License_State_code2, 
Physician_License_State_code3, Physician_License_State_code4, 
Physician_License_State_code5, 
Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, 
Total_Amount_of_Payment_USDollars, Date_of_Payment, 
Number_of_Payments_Included_in_Total_Amount, 
Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, 
City_of_Travel, State_of_Travel, Country_of_Travel, 
Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, 
Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, 
Charity_Indicator, Third_Party_Equals_Covered_Recipient_Indicator, 
Contextual_Information, Delay_in_Publication_Indicator, Record_ID, 
Dispute_Status_for_Publication, Product_Indicator, 
Name_of_Associated_Covered_Drug_or_Biological1, 
Name_of_Associated_Covered_Drug_or_Biological2, 
Name_of_Associated_Covered_Drug_or_Biological3, 
Name_of_Associated_Covered_Drug_or_Biological4, 
Name_of_Associated_Covered_Drug_or_Biological5, 
NDC_of_Associated_Covered_Drug_or_Biological1, 
NDC_of_Associated_Covered_Drug_or_Biological2, 
NDC_of_Associated_Covered_Drug_or_Biological3, 
NDC_of_Associated_Covered_Drug_or_Biological4, 
NDC_of_Associated_Covered_Drug_or_Biological5, 
Name_of_Associated_Covered_Device_or_Medical_Supply1, 
Name_of_Associated_Covered_Device_or_Medical_Supply2, 
Name_of_Associated_Covered_Device_or_Medical_Supply3, 
Name_of_Associated_Covered_Device_or_Medical_Supply4, 
Name_of_Associated_Covered_Device_or_Medical_Supply5, Program_Year, 
Payment_Publication_Date
 Schema: UNCHANGED, Covered Recipient Physician, _c2, _c3, _c4, 278352, JOHN, 
M, RAY, JR, 3625 CAPE CENTER DR, _c11, FAYETTEVILLE, NC13, 28304-4457, United 
States15, _c16, _c17, Medical Doctor, Allopathic & Osteopathic 
Physicians|Family Medicine, NC20, _c21, _c22, _c23, _c24, Par Pharmaceutical, 
Inc.25, 10010989, Par Pharmaceutical, Inc.27, NY, United States29, 17.29, 
10/23/2013, 1, In-kind items and services, Food and Beverage, _c35, _c36, _c37, 
No38, No Third Party Payment, _c40, _c41, _c42, _c43, No44, 104522962, No46, 
Covered, MEGACE ES MEGESTROL ACETATE, _c49, _c50, _c51, _c52, 4988409496, _c54, 
_c55, _c56, _c57,

[jira] [Created] (SPARK-29907) Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte.

2019-11-14 Thread Xianyin Xin (Jira)

Xianyin Xin created SPARK-29907:
---

 Summary: Move DELETE/UPDATE/MERGE relative rules to 
dmlStatementNoWith to support cte.
 Key: SPARK-29907
 URL: https://issues.apache.org/jira/browse/SPARK-29907
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xianyin Xin


SPARK-27444 introduced `dmlStatementNoWith` so that any dml that needs cte 
support can leverage it. It be better if we move DELETE/UPDATE/MERGE rules to 
`dmlStatementNoWith`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2019-11-14 Thread koert kuipers (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-29906:
--
Description: 
we observed an issue where spark seems to confuse a data line (not the first 
line of the csv file) for the csv header.

{code}
$ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP
$ unzip PGYR13_P062819.ZIP
$ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv
$ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf 
spark.sql.adaptive.enabled=true --num-executors 10
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040
Spark context available as 'sc' (master = yarn, app id = 
application_1573772077642_0006).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.format("csv").option("header", true).option("enforceSchema", 
false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1)
19/11/15 00:27:10 WARN util.package: Truncated the string representation of a 
plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
[Stage 2:>(0 + 10) / 
17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 
(TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): 
java.lang.IllegalArgumentException: CSV header does not conform to the schema.
 Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, 
Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, 
Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, 
Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, 
Recipient_Primary_Business_Street_Address_Line2, Recipient_City, 
Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, 
Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, 
Physician_License_State_code1, Physician_License_State_code2, 
Physician_License_State_code3, Physician_License_State_code4, 
Physician_License_State_code5, 
Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, 
Total_Amount_of_Payment_USDollars, Date_of_Payment, 
Number_of_Payments_Included_in_Total_Amount, 
Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, 
City_of_Travel, State_of_Travel, Country_of_Travel, 
Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, 
Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, 
Charity_Indicator, Third_Party_Equals_Covered_Recipient_Indicator, 
Contextual_Information, Delay_in_Publication_Indicator, Record_ID, 
Dispute_Status_for_Publication, Product_Indicator, 
Name_of_Associated_Covered_Drug_or_Biological1, 
Name_of_Associated_Covered_Drug_or_Biological2, 
Name_of_Associated_Covered_Drug_or_Biological3, 
Name_of_Associated_Covered_Drug_or_Biological4, 
Name_of_Associated_Covered_Drug_or_Biological5, 
NDC_of_Associated_Covered_Drug_or_Biological1, 
NDC_of_Associated_Covered_Drug_or_Biological2, 
NDC_of_Associated_Covered_Drug_or_Biological3, 
NDC_of_Associated_Covered_Drug_or_Biological4, 
NDC_of_Associated_Covered_Drug_or_Biological5, 
Name_of_Associated_Covered_Device_or_Medical_Supply1, 
Name_of_Associated_Covered_Device_or_Medical_Supply2, 
Name_of_Associated_Covered_Device_or_Medical_Supply3, 
Name_of_Associated_Covered_Device_or_Medical_Supply4, 
Name_of_Associated_Covered_Device_or_Medical_Supply5, Program_Year, 
Payment_Publication_Date
 Schema: UNCHANGED, Covered Recipient Physician, _c2, _c3, _c4, 278352, JOHN, 
M, RAY, JR, 3625 CAPE CENTER DR, _c11, FAYETTEVILLE, NC13, 28304-4457, United 
States15, _c16, _c17, Medical Doctor, Allopathic & Osteopathic 
Physicians|Family Medicine, NC20, _c21, _c22, _c23, _c24, Par Pharmaceutical, 
Inc.25, 10010989, Par Pharmaceutical, Inc.27, NY, United States29, 17.29, 
10/23/2013, 1, In-kind items and services, Food and Beverage, _c35, _c36, _c37, 
No38, No Third Party Payment, _c40, _c41, _c42, _c43, No44, 104522962, No46, 
Covered, MEGACE ES MEGESTROL ACETATE, _c49, _c50, _c51, _c52, 4988409496, _c54, 
_c55, _c56, _c57, _c58, _c59, _c60, _c61,

[jira] [Created] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2019-11-14 Thread koert kuipers (Jira)

koert kuipers created SPARK-29906:
-

 Summary: Reading of csv file fails with adaptive execution turned 
on
 Key: SPARK-29906
 URL: https://issues.apache.org/jira/browse/SPARK-29906
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: build from master today nov 14
commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, 
upstream/master, upstream/HEAD)
Author: Kevin Yu 
Date:   Thu Nov 14 14:58:32 2019 -0600

build using:
$ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn

deployed on AWS EMR 5.28 with 10 m5.xlarge slaves 

in spark-env.sh:
HADOOP_CONF_DIR=/etc/hadoop/conf

in spark-defaults.conf:
spark.master yarn
spark.submit.deployMode client
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.hadoop.yarn.timeline-service.enabled false
spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
spark.driver.extraLibraryPath 
/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
spark.executor.extraLibraryPath 
/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native

Reporter: koert kuipers


we observed an issue where spark seems to confuse a data line (not the first 
line of the csv file) for the csv header.

{code}
$ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP
$ unzip PGYR13_P062819.ZIP
$ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv
$ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf 
spark.sql.adaptive.enabled=true --num-executors 10
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040
Spark context available as 'sc' (master = yarn, app id = 
application_1573772077642_0006).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.format("csv").option("header", true).option("enforceSchema", 
false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1)
19/11/15 00:27:10 WARN util.package: Truncated the string representation of a 
plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
[Stage 2:>(0 + 10) / 
17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 
(TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): 
java.lang.IllegalArgumentException: CSV header does not conform to the schema.
 Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, 
Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, 
Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, 
Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, 
Recipient_Primary_Business_Street_Address_Line2, Recipient_City, 
Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, 
Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, 
Physician_License_State_code1, Physician_License_State_code2, 
Physician_License_State_code3, Physician_License_State_code4, 
Physician_License_State_code5, 
Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, 
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, 
Total_Amount_of_Payment_USDollars, Date_of_Payment, 
Number_of_Payments_Included_in_Total_Amount, 
Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, 
City_of_Travel, State_of_Travel, Country_of_Travel, 
Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, 
Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, 
Charity_Indicator, Third_Party_Equals_Covered_Recipient_Indicator, 
Contextual_Information, Delay_in_Publication_Indicator, Record_ID, 
Dispute_Status_for_Publication, Product_Indicator, 
Name_of_Associated_Covered_Drug_or_Biological1, 
Name_of_Associated_Covered_Drug_or_Biological2, 
Name_of_Associated_Covered_Drug_or_Biological3, 
Name_of_Associated_Covered_Drug_or_Biological4, 
Name_of_Associated_Covered_Drug_or_Biological5, 
NDC_of_Associated_Covered_Drug_or_Biological1, 
NDC_of_Associated_Covered_Drug_or_Biological2,

[jira] [Commented] (SPARK-26499) JdbcUtils.makeGetter does not handle ByteType

2019-11-14 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974757#comment-16974757
 ] 

Dongjoon Hyun commented on SPARK-26499:
---

This is backported to branch-2.4 via https://github.com/apache/spark/pull/26531 

> JdbcUtils.makeGetter does not handle ByteType
> -
>
> Key: SPARK-26499
> URL: https://issues.apache.org/jira/browse/SPARK-26499
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thomas D'Silva
>Assignee: Thomas D'Silva
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> I am trying to use the  DataSource V2 API to read from a JDBC source. While 
> using {{JdbcUtils.resultSetToSparkInternalRows}} to create an internal row 
> from a ResultSet that has a column of type TINYINT I ran into the following 
> exception
> {code:java}
> java.lang.IllegalArgumentException: Unsupported type tinyint
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter(JdbcUtils.scala:502)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters(JdbcUtils.scala:379)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.(JdbcUtils.scala:340)
> {code}
> This happens because ByteType is not handled in {{JdbcUtils.makeGetter}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26499) JdbcUtils.makeGetter does not handle ByteType

2019-11-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26499:
--
Fix Version/s: 2.4.5

> JdbcUtils.makeGetter does not handle ByteType
> -
>
> Key: SPARK-26499
> URL: https://issues.apache.org/jira/browse/SPARK-26499
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thomas D'Silva
>Assignee: Thomas D'Silva
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> I am trying to use the  DataSource V2 API to read from a JDBC source. While 
> using {{JdbcUtils.resultSetToSparkInternalRows}} to create an internal row 
> from a ResultSet that has a column of type TINYINT I ran into the following 
> exception
> {code:java}
> java.lang.IllegalArgumentException: Unsupported type tinyint
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter(JdbcUtils.scala:502)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters(JdbcUtils.scala:379)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.(JdbcUtils.scala:340)
> {code}
> This happens because ByteType is not handled in {{JdbcUtils.makeGetter}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28602) Recognize interval as a numeric type

2019-11-14 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-28602.
--
   Fix Version/s: 3.0.0
Target Version/s: 3.0.0
  Resolution: Duplicate

> Recognize interval as a numeric type
> 
>
> Key: SPARK-28602
> URL: https://issues.apache.org/jira/browse/SPARK-28602
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> Hello,
> Spark does not recognize `interval` type as a `numeric` one, which means that 
> we can't use `interval` columns in aggregated functions. For instance, the 
> following query works on PgSQL but does not work on Spark:
> {code:sql}SELECT i,AVG(cast(v as interval)) OVER (ORDER BY i ROWS BETWEEN 
> CURRENT ROW AND UNBOUNDED FOLLOWING) FROM (VALUES(1,'1 sec'),(2,'2 
> sec'),(3,NULL),(4,NULL)) t(i,v);{code}
> {code:sql}cannot resolve 'avg(CAST(`v` AS INTERVAL))' due to data type 
> mismatch: function average requires numeric types, not interval; line 1 pos 
> 9{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29889) unify the interval tests

2019-11-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29889.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26515
[https://github.com/apache/spark/pull/26515]

> unify the interval tests
> 
>
> Key: SPARK-29889
> URL: https://issues.apache.org/jira/browse/SPARK-29889
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29901) Fix broken links in SQL Reference

2019-11-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29901.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26528
[https://github.com/apache/spark/pull/26528]

> Fix broken links in SQL Reference
> -
>
> Key: SPARK-29901
> URL: https://issues.apache.org/jira/browse/SPARK-29901
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Fix the broken links



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29905) ExecutorPodsLifecycleManager has sub-optimal behavior with dynamic allocation

2019-11-14 Thread Marcelo Masiero Vanzin (Jira)

Marcelo Masiero Vanzin created SPARK-29905:
--

 Summary: ExecutorPodsLifecycleManager has sub-optimal behavior 
with dynamic allocation
 Key: SPARK-29905
 URL: https://issues.apache.org/jira/browse/SPARK-29905
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Marcelo Masiero Vanzin


I've been playing with dynamic allocation on k8s and noticed some weird 
behavior from ExecutorPodsLifecycleManager when it's on.

The cause of this behavior is mostly because of the higher rate of pod updates 
when you have dynamic allocation. Pods being created and going away all the 
time generate lots of events, that are then translated into "snapshots" 
internally in Spark, and fed to subscribers such as 
ExecutorPodsLifecycleManager.

The first effect of that is that you get a lot of spurious logging. Since 
snapshots are incremental, you can get lots of snapshots with the same 
"PodDeleted" information, for example, and ExecutorPodsLifecycleManager will 
log for all of them. Yes, log messages are at debug level, but if you're 
debugging that stuff, it's really noisy and distracting.

The second effect is that the same way you get multiple log messages, you end 
up calling into the Spark scheduler, and worse, into the K8S API server, 
multiple times for the same pod update. We can optimize that and reduce the 
chattiness with the API server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25185) CBO rowcount statistics doesn't work for partitioned parquet external table

2019-11-14 Thread venkata yerubandi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974724#comment-16974724
 ] 

venkata yerubandi edited comment on SPARK-25185 at 11/15/19 1:21 AM:
-

Is there any update on this issue ? we are facing the same issue in spark 2.4.0


was (Author: raoyvn):
Is there any update on this issue ? we are facing the same issue 

> CBO rowcount statistics doesn't work for partitioned parquet external table
> ---
>
> Key: SPARK-25185
> URL: https://issues.apache.org/jira/browse/SPARK-25185
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
> Tried on Ubuntu, FreBSD and windows, running spark-shell in local mode 
> reading data from local file system
>Reporter: Amit
>Priority: Major
>
> Created a dummy partitioned data with partition column on string type col1=a 
> and col1=b
> added csv data-> read through spark -> created partitioned external table-> 
> msck repair table to load partition. Did analyze on all columns and partition 
> column as well.
> ~println(spark.sql("select * from test_p where 
> e='1a'").queryExecution.toStringWithStats)~
>  ~val op = spark.sql("select * from test_p where 
> e='1a'").queryExecution.optimizedPlan~
> // e is the partitioned column
>  ~val stat = op.stats(spark.sessionState.conf)~
>  ~print(stat.rowCount)~
>  
> Created the same way in parquet the rowcount comes up correctly in case of 
> csv but in parquet it shows as None.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25185) CBO rowcount statistics doesn't work for partitioned parquet external table

2019-11-14 Thread venkata yerubandi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974724#comment-16974724
 ] 

venkata yerubandi commented on SPARK-25185:
---

Is there any update on this issue ? we are facing the same issue 

> CBO rowcount statistics doesn't work for partitioned parquet external table
> ---
>
> Key: SPARK-25185
> URL: https://issues.apache.org/jira/browse/SPARK-25185
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
> Tried on Ubuntu, FreBSD and windows, running spark-shell in local mode 
> reading data from local file system
>Reporter: Amit
>Priority: Major
>
> Created a dummy partitioned data with partition column on string type col1=a 
> and col1=b
> added csv data-> read through spark -> created partitioned external table-> 
> msck repair table to load partition. Did analyze on all columns and partition 
> column as well.
> ~println(spark.sql("select * from test_p where 
> e='1a'").queryExecution.toStringWithStats)~
>  ~val op = spark.sql("select * from test_p where 
> e='1a'").queryExecution.optimizedPlan~
> // e is the partitioned column
>  ~val stat = op.stats(spark.sessionState.conf)~
>  ~print(stat.rowCount)~
>  
> Created the same way in parquet the rowcount comes up correctly in case of 
> csv but in parquet it shows as None.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.

2019-11-14 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29857.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26482
[https://github.com/apache/spark/pull/26482]

> [WEB UI] Support defer render the spark history summary page. 
> --
>
> Key: SPARK-29857
> URL: https://issues.apache.org/jira/browse/SPARK-29857
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Assignee: feiwang
>Priority: Minor
> Fix For: 3.0.0
>
>
> When there are many applications in spark history server, the renderer of 
> history summary page is heavy, we can enable deferRender to tuning it.
> See details https://datatables.net/examples/ajax/defer_render.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.

2019-11-14 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29857:


Assignee: feiwang

> [WEB UI] Support defer render the spark history summary page. 
> --
>
> Key: SPARK-29857
> URL: https://issues.apache.org/jira/browse/SPARK-29857
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Assignee: feiwang
>Priority: Minor
>
> When there are many applications in spark history server, the renderer of 
> history summary page is heavy, we can enable deferRender to tuning it.
> See details https://datatables.net/examples/ajax/defer_render.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2019-11-14 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974678#comment-16974678
 ] 

Bryan Cutler commented on SPARK-29748:
--

Thanks for discussing [~zero323] . The goal here is to only remove the sorting 
of fields, which causes all kinds of weird inconsistencies like in your above 
example. I'd prefer to leave efficient field access for another time. Since Row 
is a subclass of tuple, accessing fields by name has never been efficient and I 
don't want to change the fundamental design here. The only reason to introduce 
LegacyRow (which will be deprecated) is to maintain backward compatibility with 
existing code that expects fields to be sorted.

> Remove sorting of fields in PySpark SQL Row creation
> 
>
> Key: SPARK-29748
> URL: https://issues.apache.org/jira/browse/SPARK-29748
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, when a PySpark Row is created with keyword arguments, the fields 
> are sorted alphabetically. This has created a lot of confusion with users 
> because it is not obvious (although it is stated in the pydocs) that they 
> will be sorted alphabetically, and then an error can occur later when 
> applying a schema and the field order does not match.
> The original reason for sorting fields is because kwargs in python < 3.6 are 
> not guaranteed to be in the same order that they were entered. Sorting 
> alphabetically would ensure a consistent order.  Matters are further 
> complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
> to be referenced by name when made by kwargs, but this flag is not serialized 
> with the Row and leads to inconsistent behavior.
> This JIRA proposes that any sorting of the Fields is removed. Users with 
> Python 3.6+ creating Rows with kwargs can continue to do so since Python will 
> ensure the order is the same as entered. Users with Python < 3.6 will have to 
> create Rows with an OrderedDict or by using the Row class as a factory 
> (explained in the pydoc).  If kwargs are used, an error will be raised or 
> based on a conf setting it can fall back to a LegacyRow that will sort the 
> fields as before. This LegacyRow will be immediately deprecated and removed 
> once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29865) k8s executor pods all have different prefixes in client mode

2019-11-14 Thread Erik Erlandson (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson reassigned SPARK-29865:
--

Assignee: Marcelo Masiero Vanzin

> k8s executor pods all have different prefixes in client mode
> 
>
> Key: SPARK-29865
> URL: https://issues.apache.org/jira/browse/SPARK-29865
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> This works in cluster mode since the features set things up so that all 
> executor pods have the same name prefix.
> But in client mode features are not used; so each executor ends up with a 
> different name prefix, which makes debugging a little bit annoying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29865) k8s executor pods all have different prefixes in client mode

2019-11-14 Thread Erik Erlandson (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-29865.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26488
[https://github.com/apache/spark/pull/26488]

> k8s executor pods all have different prefixes in client mode
> 
>
> Key: SPARK-29865
> URL: https://issues.apache.org/jira/browse/SPARK-29865
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> This works in cluster mode since the features set things up so that all 
> executor pods have the same name prefix.
> But in client mode features are not used; so each executor ends up with a 
> different name prefix, which makes debugging a little bit annoying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-14 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974655#comment-16974655
 ] 

Terry Kim commented on SPARK-29900:
---

If we make the relation lookup behavior consistent such that 1) temp views are 
resolved first 2) then tables are resolved,

[~brkyvz], for your example,

{code}
// Create temporary view 't'
spark.sql("create temporary view t as select 2 as i");

// BREAKING CHANGE: currently, the following is allowed.
// But with the new resolution behavior, this should not be allowed (same as 
the postgresql behavior)
spark.range(0, 5).write.saveAsTable("t") 

// you should be able to qualify the table name to make it work. 
spark.range(0, 5).write.saveAsTable("default.t") 
{code}

For the DROP behavior:
{code}
spark.sql("show tables").show
++-+---+
|database|tableName|isTemporary|
++-+---+
| default|t|  false|
||t|   true|
++-+---+

// BREAKING CHANGE: currently, the following is allowed and drops the view.
// But it should say '"t" is not a table'.
spark.sql("drop table t")
{code}

[~rdblue], yes, this will be a breaking change.

[~cloud_fan] is this in line with what you were thinking?
 

 

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29884) spark-submit to kuberentes can not parse valid ca certificate

2019-11-14 Thread Jeremy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974621#comment-16974621
 ] 

Jeremy edited comment on SPARK-29884 at 11/14/19 9:34 PM:
--

After doing some debugging it seams like this might be in fabric k8s client. It 
tries to use .kube/config even if it gets all the parameters is needs from 
arguments.


was (Author: jeremyjjbrown):
After doing some debugging it seams like this might be in fabric k8s client. I 
tries to use .kube/config even if it gets all the parameters is needs from 
arguments.

> spark-submit to kuberentes can not parse valid ca certificate
> -
>
> Key: SPARK-29884
> URL: https://issues.apache.org/jira/browse/SPARK-29884
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4
> Environment: A kuberentes cluster that has been in use for over 2 
> years and handles large amounts of production payloads.
>Reporter: Jeremy
>Priority: Major
>
> spark submit can not be used to to schedule to kuberentes with oauth token 
> and cacert
> {code:java}
> spark-submit \
> --deploy-mode cluster \
> --class org.apache.spark.examples.SparkPi \
> --master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \
> --conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf 
> spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt
>  \
> --conf spark.kubernetes.namespace=here-olp-3dds-sit \
> --conf spark.executor.instances=1 \
> --conf spark.app.name=spark-pi \
> --conf 
> spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
>  \
> --conf 
> spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0
>  \
> local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
> {code}
> returns
> {code:java}
> log4j:WARN No appenders could be found for logger 
> (io.fabric8.kubernetes.client.Config).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
>   at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183)
>   at 
> org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.security.cert.CertificateException: Could not parse 
> certificate: java.io.IOException: Empty input
>   at 
> sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110)
>   at 
> java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197)
>   at 
> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
>   at 
> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122)
>   at 
>

[jira] [Commented] (SPARK-29884) spark-submit to kuberentes can not parse valid ca certificate

2019-11-14 Thread Jeremy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974621#comment-16974621
 ] 

Jeremy commented on SPARK-29884:


After doing some debugging it seams like this might be in fabric k8s client. I 
tries to use .kube/config even if it gets all the parameters is needs from 
arguments.

> spark-submit to kuberentes can not parse valid ca certificate
> -
>
> Key: SPARK-29884
> URL: https://issues.apache.org/jira/browse/SPARK-29884
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4
> Environment: A kuberentes cluster that has been in use for over 2 
> years and handles large amounts of production payloads.
>Reporter: Jeremy
>Priority: Major
>
> spark submit can not be used to to schedule to kuberentes with oauth token 
> and cacert
> {code:java}
> spark-submit \
> --deploy-mode cluster \
> --class org.apache.spark.examples.SparkPi \
> --master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \
> --conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf 
> spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt
>  \
> --conf spark.kubernetes.namespace=here-olp-3dds-sit \
> --conf spark.executor.instances=1 \
> --conf spark.app.name=spark-pi \
> --conf 
> spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
>  \
> --conf 
> spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0
>  \
> local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
> {code}
> returns
> {code:java}
> log4j:WARN No appenders could be found for logger 
> (io.fabric8.kubernetes.client.Config).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
>   at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183)
>   at 
> org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.security.cert.CertificateException: Could not parse 
> certificate: java.io.IOException: Empty input
>   at 
> sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110)
>   at 
> java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197)
>   at 
> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
>   at 
> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122)
>   at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:78)
>   ... 13 more
> Caused by: java.io.IOException: Empty input
>   at 
> sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:106)
>   ... 19 more
> {code}
> The cacert and

[jira] [Updated] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.

2019-11-14 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-28833:
-
Priority: Minor  (was: Major)

> Document ALTER VIEW Statement in SQL Reference.
> ---
>
> Key: SPARK-28833
> URL: https://issues.apache.org/jira/browse/SPARK-28833
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Assignee: kevin yu
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.

2019-11-14 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-28833.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25573
[https://github.com/apache/spark/pull/25573]

> Document ALTER VIEW Statement in SQL Reference.
> ---
>
> Key: SPARK-28833
> URL: https://issues.apache.org/jira/browse/SPARK-28833
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Assignee: kevin yu
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.

2019-11-14 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-28833:


Assignee: kevin yu

> Document ALTER VIEW Statement in SQL Reference.
> ---
>
> Key: SPARK-28833
> URL: https://issues.apache.org/jira/browse/SPARK-28833
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Assignee: kevin yu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29904) Parse timestamps in microsecond precision by JSON/CSV datasources

2019-11-14 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-29904:
--

 Summary: Parse timestamps in microsecond precision by JSON/CSV 
datasources
 Key: SPARK-29904
 URL: https://issues.apache.org/jira/browse/SPARK-29904
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Maxim Gekk


Currently, Spark can parse strings with timestamps from JSON/CSV in millisecond 
precision. Internally, timestamps have microsecond precision. The ticket aims 
to modify parsing logic in Spark 2.4 to support the microsecond precision. 
Porting of DateFormatter/TimestampFormatter from Spark 3.0-preview is risky, 
so, need to find another lighter solution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29903) Add documentation for recursiveFileLookup

2019-11-14 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974577#comment-16974577
 ] 

Nicholas Chammas commented on SPARK-29903:
--

cc [~cloud_fan] and [~weichenxu123]

> Add documentation for recursiveFileLookup
> -
>
> Key: SPARK-29903
> URL: https://issues.apache.org/jira/browse/SPARK-29903
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively 
> loading data from a source directory. There is currently no documentation for 
> this option.
> We should document this both for the DataFrame API as well as for SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29903) Add documentation for recursiveFileLookup

2019-11-14 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-29903:


 Summary: Add documentation for recursiveFileLookup
 Key: SPARK-29903
 URL: https://issues.apache.org/jira/browse/SPARK-29903
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively 
loading data from a source directory. There is currently no documentation for 
this option.

We should document this both for the DataFrame API as well as for SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29902) Add listener event queue capacity configuration to documentation

2019-11-14 Thread shahid (Jira)

shahid created SPARK-29902:
--

 Summary: Add listener event queue capacity configuration to 
documentation
 Key: SPARK-29902
 URL: https://issues.apache.org/jira/browse/SPARK-29902
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: shahid


Add listener event queue capacity configuration to documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-14 Thread John Bauer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974527#comment-16974527
 ] 

John Bauer commented on SPARK-29691:


[[SPARK-29691] ensure Param objects are valid in fit, 
transform|https://github.com/apache/spark/pull/26527]

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29672) update spark testing framework to use python3

2019-11-14 Thread Shane Knapp (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-29672.
-
Resolution: Fixed

> update spark testing framework to use python3
> -
>
> Key: SPARK-29672
> URL: https://issues.apache.org/jira/browse/SPARK-29672
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344]
> it's time, at least for 3.0+ to migrate the test execution framework to 
> python 3.6.
> this PR ([https://github.com/apache/spark/pull/26330]) does all of the above.
> after 3.0 is cut, we can then add python3.8 and drop python2.7 and pypy2.5.1 
> test support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29672) update spark testing framework to use python3

2019-11-14 Thread Shane Knapp (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp updated SPARK-29672:

Description: 
python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344]

it's time, at least for 3.0+ to migrate the test execution framework to python 
3.6.

this PR ([https://github.com/apache/spark/pull/26330]) does all of the above.

after 3.0 is cut, we can then add python3.8 and drop python2.7 and pypy2.5.1 
test support.

  was:
python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344]

it's time, at least for 3.0+ to remove python 2.7 test support and migrate the 
test execution framework to python 3.6.

this PR ([https://github.com/apache/spark/pull/26330]) does all of the above.


> update spark testing framework to use python3
> -
>
> Key: SPARK-29672
> URL: https://issues.apache.org/jira/browse/SPARK-29672
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344]
> it's time, at least for 3.0+ to migrate the test execution framework to 
> python 3.6.
> this PR ([https://github.com/apache/spark/pull/26330]) does all of the above.
> after 3.0 is cut, we can then add python3.8 and drop python2.7 and pypy2.5.1 
> test support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29672) update spark testing framework to use python3

2019-11-14 Thread Shane Knapp (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp updated SPARK-29672:

Summary: update spark testing framework to use python3  (was: remove 
python2 tests and test infra)

> update spark testing framework to use python3
> -
>
> Key: SPARK-29672
> URL: https://issues.apache.org/jira/browse/SPARK-29672
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344]
> it's time, at least for 3.0+ to remove python 2.7 test support and migrate 
> the test execution framework to python 3.6.
> this PR ([https://github.com/apache/spark/pull/26330]) does all of the above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29901) Fix broken links in SQL Reference

2019-11-14 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-29901:
--

 Summary: Fix broken links in SQL Reference
 Key: SPARK-29901
 URL: https://issues.apache.org/jira/browse/SPARK-29901
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Fix the broken links



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-14 Thread Ryan Blue (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974456#comment-16974456
 ] 

Ryan Blue commented on SPARK-29900:
---

To be clear, we think this is going to be a breaking change, right?

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-14 Thread Burak Yavuz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974436#comment-16974436
 ] 

Burak Yavuz commented on SPARK-29900:
-

I definitely agree the behavior is very confusing here. (For example, you can 
saveAsTable into a table, while a temp table with the same name exists... Once 
you query the table, you get the temp table back). Can we post here the 
proposed behavior?

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-14 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974433#comment-16974433
 ] 

Terry Kim commented on SPARK-29900:
---

Yes. Thanks [~cloud_fan]

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29899) Can not recursively lookup files in Hive table via SQL

2019-11-14 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29899:
---
Description: SPARK-27990 provide a way to recursively load data from 
datasource. In SQL, when query a hive table, this property passed by the 
`relation.tableMeta.properties`. But it is filtered out now. So we can not 
lookup file recursively for a Hive table.  (was: SPARK-27990 provide a way to 
recursively load data from datasource. In SQL, this property passed by the 
`relation.tableMeta.properties`. But in Parquet file format, it is filtered 
out. So we can not lookup file recursively for a table.)

> Can not recursively lookup files in Hive table via SQL
> --
>
> Key: SPARK-29899
> URL: https://issues.apache.org/jira/browse/SPARK-29899
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> SPARK-27990 provide a way to recursively load data from datasource. In SQL, 
> when query a hive table, this property passed by the 
> `relation.tableMeta.properties`. But it is filtered out now. So we can not 
> lookup file recursively for a Hive table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29899) Can not recursively lookup files in Hive table via SQL

2019-11-14 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29899:
---
Summary: Can not recursively lookup files in Hive table via SQL  (was: Can 
not set recursiveFileLookup property in SQL)

> Can not recursively lookup files in Hive table via SQL
> --
>
> Key: SPARK-29899
> URL: https://issues.apache.org/jira/browse/SPARK-29899
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> SPARK-27990 provide a way to recursively load data from datasource. In SQL, 
> this property passed by the `relation.tableMeta.properties`. But in Parquet 
> file format, it is filtered out. So we can not lookup file recursively for a 
> table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29899) Can not set recursiveFileLookup property in SQL

2019-11-14 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29899:
---
Summary: Can not set recursiveFileLookup property in SQL  (was: Can not set 
recursiveFileLookup property in TBLPROPERTIES if file format is Parquet)

> Can not set recursiveFileLookup property in SQL
> ---
>
> Key: SPARK-29899
> URL: https://issues.apache.org/jira/browse/SPARK-29899
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> SPARK-27990 provide a way to recursively load data from datasource. In SQL, 
> this property passed by the `relation.tableMeta.properties`. But in Parquet 
> file format, it is filtered out. So we can not lookup file recursively for a 
> table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-14 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974352#comment-16974352
 ] 

Wenchen Fan commented on SPARK-29900:
-

[~imback82] do you want to drive it?

also cc [~rdblue] [~brkyvz] [~dongjoon]

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-14 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29900:

Description: 
Currently, Spark has 2 different relation resolution behaviors:
1. try to look up temp view first, then try table/persistent view.
2. try to look up table/persistent view.

The first behavior is used in SELECT, INSERT and a few commands that support 
views, like DESC TABLE.

The second behavior is used in most commands.

It's confusing to have inconsistent relation resolution behaviors, and the 
benefit is super small. It's only useful when there are temp view and table 
with the same name, but users can easily use qualified table name to 
disambiguate.

In postgres, the relation resolution behavior is consistent
{code}
cloud0fan=# create schema s1;
CREATE SCHEMA
cloud0fan=# SET search_path TO s1;
SET
cloud0fan=# create table s1.t (i int);
CREATE TABLE
cloud0fan=# insert into s1.t values (1);
INSERT 0 1

# access table with qualified name
cloud0fan=# select * from s1.t;
 i 
---
 1
(1 row)

# access table with single name
cloud0fan=# select * from t;
 i 
---
 1
(1 rows)

# create a temp view with conflicting name
cloud0fan=# create temp view t as select 2 as i;
CREATE VIEW

# same as spark, temp view has higher proirity during resolution
cloud0fan=# select * from t;
 i 
---
 2
(1 row)

# DROP TABLE also resolves temp view first
cloud0fan=# drop table t;
ERROR:  "t" is not a table

# DELETE also resolves temp view first
cloud0fan=# delete from t where i = 0;
ERROR:  cannot delete from view "t"
{code}


  was:
Currently, Spark has 2 different relation resolution behaviors:
1. try to look up temp view first, then try table/persistent view.
2. try to look up table/persistent view.

The first behavior is used in SELECT, INSERT and a few commands that support 
views, like DESC TABLE.

The second behavior is used in most commands.

It's confusing to have inconsistent relation resolution behaviors, and the 
benefit is super small. It's only useful when there are temp view and table 
with the same name, but users can easily use qualified table name to 
disambiguate.

In postgres, the relation resolution behavior is consistent
{code}
cloud0fan=# create schema s1;
CREATE SCHEMA
cloud0fan=# SET search_path TO s1;
SET
cloud0fan=# create table s1.t (i int);
CREATE TABLE
cloud0fan=# insert into s1.t values (1);
INSERT 0 1

# access table with qualified name
cloud0fan=# select * from s1.t;
 i 
---
 1
(1 row)

# access table with single name
cloud0fan=# select * from t;
 i 
---
 1
 2
(2 rows)

# create a temp view with conflicting name
cloud0fan=# create temp view t as select 2 as i;
CREATE VIEW

# same as spark, temp view has higher proirity during resolution
cloud0fan=# select * from t;
 i 
---
 2
(1 row)

# DROP TABLE also resolves temp view first
cloud0fan=# drop table t;
ERROR:  "t" is not a table

# DELETE also resolves temp view first
cloud0fan=# delete from t where i = 0;
ERROR:  cannot delete from view "t"
{code}



> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}

[jira] [Created] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-14 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-29900:
---

 Summary: make relation lookup behavior consistent within Spark SQL
 Key: SPARK-29900
 URL: https://issues.apache.org/jira/browse/SPARK-29900
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan


Currently, Spark has 2 different relation resolution behaviors:
1. try to look up temp view first, then try table/persistent view.
2. try to look up table/persistent view.

The first behavior is used in SELECT, INSERT and a few commands that support 
views, like DESC TABLE.

The second behavior is used in most commands.

It's confusing to have inconsistent relation resolution behaviors, and the 
benefit is super small. It's only useful when there are temp view and table 
with the same name, but users can easily use qualified table name to 
disambiguate.

In postgres, the relation resolution behavior is consistent
{code}
cloud0fan=# create schema s1;
CREATE SCHEMA
cloud0fan=# SET search_path TO s1;
SET
cloud0fan=# create table s1.t (i int);
CREATE TABLE
cloud0fan=# insert into s1.t values (1);
INSERT 0 1

# access table with qualified name
cloud0fan=# select * from s1.t;
 i 
---
 1
(1 row)

# access table with single name
cloud0fan=# select * from t;
 i 
---
 1
 2
(2 rows)

# create a temp view with conflicting name
cloud0fan=# create temp view t as select 2 as i;
CREATE VIEW

# same as spark, temp view has higher proirity during resolution
cloud0fan=# select * from t;
 i 
---
 2
(1 row)

# DROP TABLE also resolves temp view first
cloud0fan=# drop table t;
ERROR:  "t" is not a table

# DELETE also resolves temp view first
cloud0fan=# delete from t where i = 0;
ERROR:  cannot delete from view "t"
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29899) Can not set recursiveFileLookup property in TBLPROPERTIES if file format is Parquet

2019-11-14 Thread Lantao Jin (Jira)

Lantao Jin created SPARK-29899:
--

 Summary: Can not set recursiveFileLookup property in TBLPROPERTIES 
if file format is Parquet
 Key: SPARK-29899
 URL: https://issues.apache.org/jira/browse/SPARK-29899
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin


SPARK-27990 provide a way to recursively load data from datasource. In SQL, 
this property passed by the `relation.tableMeta.properties`. But in Parquet 
file format, it is filtered out. So we can not lookup file recursively for a 
table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types

2019-11-14 Thread Carlos del Prado Mota (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos del Prado Mota updated SPARK-29898:
--
Description: 
Extends options for the Spark Avro formatter allowing to use custom Avro 
logical types.

At the moment only timestamp and decimal logical types are supported at Spark 
but Avro support any conversion that you could need. This change keep the 
default mappings and allow to add news.

{{spark}}
   {{ .read}}
   {{ .format("avro")}}
   {{ .option("logicalTypeMapper", 
"org.example.CustomAvroLogicalCatalystMapper")}}
  {{  .load()}}

Only you need is register your custom Avro logical type and then implement 
`AvroLogicalTypeCatalystMapper`

  was:
Extends options for the Spark Avro formatter allowing to use custom Avro 
logical types.

At the moment only timestamp and decimal logical types are supported at Spark 
but Avro support any conversion that you could need. This change keep the 
default mappings and allow to add news.

{{spark}}
   {{ .read}}
   {{ .format("avro")}}
   {{ .option("logicalTypeMapper", 
"org.example.CustomAvroLogicalCatalystMapper")}}
  {{ .load()}}

Only you need is register your custom Avro logical type and then implement 
`AvroLogicalTypeCatalystMapper`


> Support Avro Custom Logical Types
> -
>
> Key: SPARK-29898
> URL: https://issues.apache.org/jira/browse/SPARK-29898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Carlos del Prado Mota
>Priority: Major
>
> Extends options for the Spark Avro formatter allowing to use custom Avro 
> logical types.
> At the moment only timestamp and decimal logical types are supported at Spark 
> but Avro support any conversion that you could need. This change keep the 
> default mappings and allow to add news.
> {{spark}}
>    {{ .read}}
>    {{ .format("avro")}}
>    {{ .option("logicalTypeMapper", 
> "org.example.CustomAvroLogicalCatalystMapper")}}
>   {{  .load()}}
> Only you need is register your custom Avro logical type and then implement 
> `AvroLogicalTypeCatalystMapper`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types

2019-11-14 Thread Carlos del Prado Mota (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos del Prado Mota updated SPARK-29898:
--
Description: 
Extends options for the Spark Avro formatter allowing to use custom Avro 
logical types.

At the moment only timestamp and decimal logical types are supported at Spark 
but Avro support any conversion that you could need. This change keep the 
default mappings and allow to add news.

{{spark}}
   {{ .read}}
   {{ .format("avro")}}
   {{ .option("logicalTypeMapper", 
"org.example.CustomAvroLogicalCatalystMapper")}}
  {{ .load()}}

Only you need is register your custom Avro logical type and then implement 
`AvroLogicalTypeCatalystMapper`

  was:
Extends options for the Spark Avro formatter allowing to use custom Avro 
logical types.

At the moment only timestamp and decimal logical types are supported at Spark 
but Avro support any conversion that you could need. This change keep the 
default mappings and allow to add news.

{{spark}}
  {{ .read}}
  {{ .format("avro")
  {{ .option("logicalTypeMapper", 
"org.example.CustomAvroLogicalCatalystMapper")
 {{ .load()}}

Only you need is register your custom Avro logical type and then implement 
`AvroLogicalTypeCatalystMapper`


> Support Avro Custom Logical Types
> -
>
> Key: SPARK-29898
> URL: https://issues.apache.org/jira/browse/SPARK-29898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Carlos del Prado Mota
>Priority: Major
>
> Extends options for the Spark Avro formatter allowing to use custom Avro 
> logical types.
> At the moment only timestamp and decimal logical types are supported at Spark 
> but Avro support any conversion that you could need. This change keep the 
> default mappings and allow to add news.
> {{spark}}
>    {{ .read}}
>    {{ .format("avro")}}
>    {{ .option("logicalTypeMapper", 
> "org.example.CustomAvroLogicalCatalystMapper")}}
>   {{ .load()}}
> Only you need is register your custom Avro logical type and then implement 
> `AvroLogicalTypeCatalystMapper`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types

2019-11-14 Thread Carlos del Prado Mota (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos del Prado Mota updated SPARK-29898:
--
Description: 
Extends options for the Spark Avro formatter allowing to use custom Avro 
logical types.

At the moment only timestamp and decimal logical types are supported at Spark 
but Avro support any conversion that you could need. This change keep the 
default mappings and allow to add news.

spark
{{  {{ .read
{{  {{ .format("avro")
{{  {{ .option("logicalTypeMapper", 
"org.example.CustomAvroLogicalCatalystMapper")
{{ {{  .load()

Only you need is register your custom Avro logical type and then implement 
`AvroLogicalTypeCatalystMapper`

  was:
Extends options for the Spark Avro formatter allowing to use custom Avro 
logical types.

At the moment only timestamp and decimal logical types are supported at Spark 
but Avro support any conversion that you could need. This change keep the 
default mappings and allow to add news.

{{spark}}
 {{ .read}}
 {{ .format("avro")}}
 {{ .option("logicalTypeMapper", 
"org.example.CustomAvroLogicalCatalystMapper")}}
 {{ .load()}}

Only you need is register your custom Avro logical type and then implement 
`AvroLogicalTypeCatalystMapper`


> Support Avro Custom Logical Types
> -
>
> Key: SPARK-29898
> URL: https://issues.apache.org/jira/browse/SPARK-29898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Carlos del Prado Mota
>Priority: Major
>
> Extends options for the Spark Avro formatter allowing to use custom Avro 
> logical types.
> At the moment only timestamp and decimal logical types are supported at Spark 
> but Avro support any conversion that you could need. This change keep the 
> default mappings and allow to add news.
> spark
> {{  {{ .read
> {{  {{ .format("avro")
> {{  {{ .option("logicalTypeMapper", 
> "org.example.CustomAvroLogicalCatalystMapper")
> {{ {{  .load()
> Only you need is register your custom Avro logical type and then implement 
> `AvroLogicalTypeCatalystMapper`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types

2019-11-14 Thread Carlos del Prado Mota (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos del Prado Mota updated SPARK-29898:
--
Description: 
Extends options for the Spark Avro formatter allowing to use custom Avro 
logical types.

At the moment only timestamp and decimal logical types are supported at Spark 
but Avro support any conversion that you could need. This change keep the 
default mappings and allow to add news.

{{spark}}
  {{ .read}}
  {{ .format("avro")
  {{ .option("logicalTypeMapper", 
"org.example.CustomAvroLogicalCatalystMapper")
 {{ .load()}}

Only you need is register your custom Avro logical type and then implement 
`AvroLogicalTypeCatalystMapper`

  was:
Extends options for the Spark Avro formatter allowing to use custom Avro 
logical types.

At the moment only timestamp and decimal logical types are supported at Spark 
but Avro support any conversion that you could need. This change keep the 
default mappings and allow to add news.

spark
{{  {{ .read
{{  {{ .format("avro")
{{  {{ .option("logicalTypeMapper", 
"org.example.CustomAvroLogicalCatalystMapper")
{{ {{  .load()

Only you need is register your custom Avro logical type and then implement 
`AvroLogicalTypeCatalystMapper`


> Support Avro Custom Logical Types
> -
>
> Key: SPARK-29898
> URL: https://issues.apache.org/jira/browse/SPARK-29898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Carlos del Prado Mota
>Priority: Major
>
> Extends options for the Spark Avro formatter allowing to use custom Avro 
> logical types.
> At the moment only timestamp and decimal logical types are supported at Spark 
> but Avro support any conversion that you could need. This change keep the 
> default mappings and allow to add news.
> {{spark}}
>   {{ .read}}
>   {{ .format("avro")
>   {{ .option("logicalTypeMapper", 
> "org.example.CustomAvroLogicalCatalystMapper")
>  {{ .load()}}
> Only you need is register your custom Avro logical type and then implement 
> `AvroLogicalTypeCatalystMapper`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types

2019-11-14 Thread Carlos del Prado Mota (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos del Prado Mota updated SPARK-29898:
--
Description: 
Extends options for the Spark Avro formatter allowing to use custom Avro 
logical types.

At the moment only timestamp and decimal logical types are supported at Spark 
but Avro support any conversion that you could need. This change keep the 
default mappings and allow to add news.

{{spark}}
 {{ .read}}
 {{ .format("avro")}}
 {{ .option("logicalTypeMapper", 
"org.example.CustomAvroLogicalCatalystMapper")}}
 {{ .load()}}

Only you need is register your custom Avro logical type and then implement 
`AvroLogicalTypeCatalystMapper`

  was:
Extends options for the Spark Avro formatter allowing to use custom Avro 
logical types.

At the moment only timestamp and decimal logical types are supported at Spark 
but Avro support any conversion that you could need. This change keep the 
default mappings and allow to add news.

spark
{{ {{  .read
{{ {{  .format("avro")
{{ {{  .option("logicalTypeMapper", 
"org.example.CustomAvroLogicalCatalystMapper")
{{  .load()}}

Only you need is register your custom Avro logical type and then implement 
`AvroLogicalTypeCatalystMapper`


> Support Avro Custom Logical Types
> -
>
> Key: SPARK-29898
> URL: https://issues.apache.org/jira/browse/SPARK-29898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Carlos del Prado Mota
>Priority: Major
>
> Extends options for the Spark Avro formatter allowing to use custom Avro 
> logical types.
> At the moment only timestamp and decimal logical types are supported at Spark 
> but Avro support any conversion that you could need. This change keep the 
> default mappings and allow to add news.
> {{spark}}
>  {{ .read}}
>  {{ .format("avro")}}
>  {{ .option("logicalTypeMapper", 
> "org.example.CustomAvroLogicalCatalystMapper")}}
>  {{ .load()}}
> Only you need is register your custom Avro logical type and then implement 
> `AvroLogicalTypeCatalystMapper`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types

2019-11-14 Thread Carlos del Prado Mota (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos del Prado Mota updated SPARK-29898:
--
Description: 
Extends options for the Spark Avro formatter allowing to use custom Avro 
logical types.

At the moment only timestamp and decimal logical types are supported at Spark 
but Avro support any conversion that you could need. This change keep the 
default mappings and allow to add news.

spark
{{ {{  .read
{{ {{  .format("avro")
{{ {{  .option("logicalTypeMapper", 
"org.example.CustomAvroLogicalCatalystMapper")
{{  .load()}}

Only you need is register your custom Avro logical type and then implement 
`AvroLogicalTypeCatalystMapper`

> Support Avro Custom Logical Types
> -
>
> Key: SPARK-29898
> URL: https://issues.apache.org/jira/browse/SPARK-29898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Carlos del Prado Mota
>Priority: Major
>
> Extends options for the Spark Avro formatter allowing to use custom Avro 
> logical types.
> At the moment only timestamp and decimal logical types are supported at Spark 
> but Avro support any conversion that you could need. This change keep the 
> default mappings and allow to add news.
> spark
> {{ {{  .read
> {{ {{  .format("avro")
> {{ {{  .option("logicalTypeMapper", 
> "org.example.CustomAvroLogicalCatalystMapper")
> {{  .load()}}
> Only you need is register your custom Avro logical type and then implement 
> `AvroLogicalTypeCatalystMapper`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29898) Support Avro Custom Logical Types

2019-11-14 Thread Carlos del Prado Mota (Jira)

Carlos del Prado Mota created SPARK-29898:
-

 Summary: Support Avro Custom Logical Types
 Key: SPARK-29898
 URL: https://issues.apache.org/jira/browse/SPARK-29898
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Carlos del Prado Mota






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29897) Implicit cast to timestamp is failing

2019-11-14 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-29897:


 Summary: Implicit cast to timestamp is failing 
 Key: SPARK-29897
 URL: https://issues.apache.org/jira/browse/SPARK-29897
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


Spark can not cast implicitly
 jdbc:hive2://10.18.19.208:23040/default> SELECT EXTRACT(DAY FROM NOW() - 
'2014-08-02 08:10:56');
Error: org.apache.spark.sql.AnalysisException: cannot resolve 
'(current_timestamp() - CAST('2014-08-02 08:10:56' AS DOUBLE))' due to data 
type mismatch: differing types in '(current_timestamp() - CAST('2014-08-02 
08:10:56' AS DOUBLE))' (timestamp and double).; line 1 pos 24;
PostgreSQL and MySQL can handle the same.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20110) Windowed aggregation do not work when the timestamp is a nested field

2019-11-14 Thread hurelhuyag (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974177#comment-16974177
 ] 

hurelhuyag commented on SPARK-20110:


I just faced same problem now. It's spark version 2.4.4. I don't understand 
what's difference. 2 query doing same thing. If first is wrong then second 
should wrong.

> Windowed aggregation do not work when the timestamp is a nested field
> -
>
> Key: SPARK-20110
> URL: https://issues.apache.org/jira/browse/SPARK-20110
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.0
>Reporter: Alexis Seigneurin
>Priority: Major
>  Labels: bulk-closed
>
> I am loading data into a DataFrame with nested fields. I want to perform a 
> windowed aggregation on the timestamp from a nested fields:
> {code}
>   .groupBy(window($"auth.sysEntryTimestamp", "2 minutes"))
> {code}
> I get the following error:
> {quote}
> org.apache.spark.sql.AnalysisException: Multiple time window expressions 
> would result in a cartesian product of rows, therefore they are not currently 
> not supported.
> {quote}
> This works fine if I first extract the timestamp to a separate column:
> {code}
>   .withColumn("sysEntryTimestamp", $"auth.sysEntryTimestamp")
>   .groupBy(
> window($"sysEntryTimestamp", "2 minutes")
>   )
> {code}
> Please see the whole sample:
> - batch: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363210/3769253384867782/latest.html
> - Structured Streaming: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363192/3769253384867782/latest.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20110) Windowed aggregation do not work when the timestamp is a nested field

2019-11-14 Thread hurelhuyag (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974177#comment-16974177
 ] 

hurelhuyag edited comment on SPARK-20110 at 11/14/19 12:09 PM:
---

I just faced same problem. It's spark version 2.4.4. I don't understand what's 
difference. 2 query doing same thing. If first is wrong then second should 
wrong.


was (Author: hurelhuyag):
I just faced same problem now. It's spark version 2.4.4. I don't understand 
what's difference. 2 query doing same thing. If first is wrong then second 
should wrong.

> Windowed aggregation do not work when the timestamp is a nested field
> -
>
> Key: SPARK-20110
> URL: https://issues.apache.org/jira/browse/SPARK-20110
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.0
>Reporter: Alexis Seigneurin
>Priority: Major
>  Labels: bulk-closed
>
> I am loading data into a DataFrame with nested fields. I want to perform a 
> windowed aggregation on the timestamp from a nested fields:
> {code}
>   .groupBy(window($"auth.sysEntryTimestamp", "2 minutes"))
> {code}
> I get the following error:
> {quote}
> org.apache.spark.sql.AnalysisException: Multiple time window expressions 
> would result in a cartesian product of rows, therefore they are not currently 
> not supported.
> {quote}
> This works fine if I first extract the timestamp to a separate column:
> {code}
>   .withColumn("sysEntryTimestamp", $"auth.sysEntryTimestamp")
>   .groupBy(
> window($"sysEntryTimestamp", "2 minutes")
>   )
> {code}
> Please see the whole sample:
> - batch: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363210/3769253384867782/latest.html
> - Structured Streaming: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363192/3769253384867782/latest.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29896) Extend typed literals support for all spark native types

2019-11-14 Thread Kent Yao (Jira)

Kent Yao created SPARK-29896:


 Summary: Extend typed literals support for all spark native types
 Key: SPARK-29896
 URL: https://issues.apache.org/jira/browse/SPARK-29896
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


Currently, Date, Timestamp, Interval, Binary, and INTEGER typed literals are 
supported.

We should support other native datatypes for this feature.
{code:sql}
+-- typed literals
+-- boolean
+select boolean 'true';
+select boolean 'false';
+select boolean 't';
+select boolean 'f';
+select boolean 'yes';
+select boolean 'no';
+select -boolean 'true';
+
+-- byte
+select tinyint '1';
+select tinyint '-1';
+select tinyint '128';
+select byte '1';
+select -tinyint '1';
+
+-- short
+select smallint '1';
+select smallint '-1';
+select smallint '32768';
+select short '1';
+select -smallint '1';
+
+-- long
+select long '1';
+select bigint '-1';
+select -bigint '1';
+
+-- float/double
+select float '1';
+select -float '-1';
+select double '1';
+select -double '1';
+
+-- hive string type
+select char(10) '12345';
+select varchar(10) '12345';
+
+-- binary
+select binary '12345';
+
+-- decimal
+select decimal '1.01';
+select decimal(10, 2) '11.1';
+select decimal(2, 0) '11.1';
+select decimal(2, 1) '11.1';
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29895) Extend typed literals support for all spark native types

2019-11-14 Thread Kent Yao (Jira)

Kent Yao created SPARK-29895:


 Summary: Extend typed literals support for all spark native types
 Key: SPARK-29895
 URL: https://issues.apache.org/jira/browse/SPARK-29895
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


Currently, Date, Timestamp, Interval, Binary, and INTEGER typed literals are 
supported.

We should support other native datatypes for this feature.
{code:sql}
+-- typed literals
+-- boolean
+select boolean 'true';
+select boolean 'false';
+select boolean 't';
+select boolean 'f';
+select boolean 'yes';
+select boolean 'no';
+select -boolean 'true';
+
+-- byte
+select tinyint '1';
+select tinyint '-1';
+select tinyint '128';
+select byte '1';
+select -tinyint '1';
+
+-- short
+select smallint '1';
+select smallint '-1';
+select smallint '32768';
+select short '1';
+select -smallint '1';
+
+-- long
+select long '1';
+select bigint '-1';
+select -bigint '1';
+
+-- float/double
+select float '1';
+select -float '-1';
+select double '1';
+select -double '1';
+
+-- hive string type
+select char(10) '12345';
+select varchar(10) '12345';
+
+-- binary
+select binary '12345';
+
+-- decimal
+select decimal '1.01';
+select decimal(10, 2) '11.1';
+select decimal(2, 0) '11.1';
+select decimal(2, 1) '11.1';
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21040) On executor/worker decommission consider speculatively re-launching current tasks

2019-11-14 Thread Prakhar Jain (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974162#comment-16974162
 ] 

Prakhar Jain commented on SPARK-21040:
--

Hi [~holden], At Microsoft, we are also facing same issues while adding support 
for low-priority VMs and we are working on similar lines.

We have considered following options:
Option 1) Whenever an executor goes to decommissioning state, we can consider 
all the tasks that are running on that executor for speculation (without 
worrying about "spark.speculation.quantile" or "spark.speculation.multiplier")

Option 2) Whenever an executor goes to decommissioning state, Check the 
following for each task running on that executor

  - Check if X% of tasks have finished in the corresponding stage and identify 
the median time
  - if (MedianTime - RunTimeOfTaskInConsideration) > cloud_threshold then 
consider the task for speculation. cloud_threshold can be set as a 
configuration parameter (Ex. 120 seconds for aws spot instances etc)


What are your thoughts on the same?

> On executor/worker decommission consider speculatively re-launching current 
> tasks
> -
>
> Key: SPARK-21040
> URL: https://issues.apache.org/jira/browse/SPARK-21040
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> If speculative execution is enabled we may wish to consider decommissioning 
> of worker as a weight for speculative execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-11-14 Thread huangtianhua (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974143#comment-16974143
 ] 

huangtianhua commented on SPARK-29106:
--

[~shaneknapp], the vm is ready, I have build/test in /home/jenkins/spark, and 
because the image of old arm testing instance is too large, so we can't create 
the new instance with the image, we copy the contents of /home/jenkins/ into 
new instance.

And because of the network performance, we cache the local source some about 
"hive-ivy"  into /home/jenkins/hive-ivy-cache, please export the environment 
{color:#de350b}SPARK_VERSIONS_SUITE_IVY_PATH=/home/jenkins/hive-ivy-cache/{color}
 before maven test.  

I will send the details info of the vm to your email later.

Please add it as worker of amplab jenkins, and try to build the two jobs as we 
did before, don't hesitate to contact us if you have any questions, thanks very 
much.

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29894) Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab

2019-11-14 Thread Luca Canali (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-29894:

Attachment: snippet_plan_graph_before_patch.png

> Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab
> ---
>
> Key: SPARK-29894
> URL: https://issues.apache.org/jira/browse/SPARK-29894
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: snippet__plan_graph_with_Codegen_Stage_Id_Annotated.png, 
> snippet_plan_graph_before_patch.png
>
>
> The Web UI SQL Tab provides information on the executed SQL using plan graphs 
> and SQL execution plans. Both provide useful information. Physical execution 
> plans report the Codegen Stage Id. It is useful to have Codegen Stage Id also 
> reported in the plan graphs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29894) Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab

2019-11-14 Thread Luca Canali (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-29894:

Attachment: snippet__plan_graph_with_Codegen_Stage_Id_Annotated.png

> Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab
> ---
>
> Key: SPARK-29894
> URL: https://issues.apache.org/jira/browse/SPARK-29894
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: snippet__plan_graph_with_Codegen_Stage_Id_Annotated.png
>
>
> The Web UI SQL Tab provides information on the executed SQL using plan graphs 
> and SQL execution plans. Both provide useful information. Physical execution 
> plans report the Codegen Stage Id. It is useful to have Codegen Stage Id also 
> reported in the plan graphs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29894) Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab

2019-11-14 Thread Luca Canali (Jira)

Luca Canali created SPARK-29894:
---

 Summary: Add Codegen Stage Id to Spark plan graphs in Web UI SQL 
Tab
 Key: SPARK-29894
 URL: https://issues.apache.org/jira/browse/SPARK-29894
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 3.0.0
Reporter: Luca Canali


The Web UI SQL Tab provides information on the executed SQL using plan graphs 
and SQL execution plans. Both provide useful information. Physical execution 
plans report the Codegen Stage Id. It is useful to have Codegen Stage Id also 
reported in the plan graphs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29893) Improve the local reader performance by changing the task number from 1 to multi

2019-11-14 Thread Ke Jia (Jira)

Ke Jia created SPARK-29893:
--

 Summary: Improve the local reader performance by changing the task 
number from 1 to multi
 Key: SPARK-29893
 URL: https://issues.apache.org/jira/browse/SPARK-29893
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


The currently local reader read all the partition of map stage only using 1 
task, which may cause the performance degradation. This PR will improve the 
performance by using multi tasks instead of one task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29892) Add built-in Array Functions: array_cat

2019-11-14 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-29892:
---
Description: 
|{{array_cat}}{{(}}{{anyarray}}{{, }}{{anyarray}}{{)}}|{{anyarray}}|concatenate 
two arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{\{ {1,2,3,4,5}}}|

Other DBs:

[https://phoenix.apache.org/language/functions.html#array_cat]

  was:|{{array_cat}}{{(}}{{anyarray}}{{, 
}}{{anyarray}}{{)}}|{{anyarray}}|concatenate two 
arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{{{1,2,3,4,5}}}|


> Add built-in Array Functions: array_cat
> ---
>
> Key: SPARK-29892
> URL: https://issues.apache.org/jira/browse/SPARK-29892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> |{{array_cat}}{{(}}{{anyarray}}{{, 
> }}{{anyarray}}{{)}}|{{anyarray}}|concatenate two 
> arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{\{ {1,2,3,4,5}}}|
> Other DBs:
> [https://phoenix.apache.org/language/functions.html#array_cat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29892) Add built-in Array Functions: array_cat

2019-11-14 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974089#comment-16974089
 ] 

jiaan.geng commented on SPARK-29892:


I'm working on.

> Add built-in Array Functions: array_cat
> ---
>
> Key: SPARK-29892
> URL: https://issues.apache.org/jira/browse/SPARK-29892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> |{{array_cat}}{{(}}{{anyarray}}{{, 
> }}{{anyarray}}{{)}}|{{anyarray}}|concatenate two 
> arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{{{1,2,3,4,5}}}|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29892) Add built-in Array Functions: array_cat

2019-11-14 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-29892:
--

 Summary: Add built-in Array Functions: array_cat
 Key: SPARK-29892
 URL: https://issues.apache.org/jira/browse/SPARK-29892
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


|{{array_cat}}{{(}}{{anyarray}}{{, }}{{anyarray}}{{)}}|{{anyarray}}|concatenate 
two arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{{{1,2,3,4,5}}}|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-14 Thread sandeshyapuram (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974087#comment-16974087
 ] 

sandeshyapuram edited comment on SPARK-29890 at 11/14/19 9:36 AM:
--

I've raised it as a bug because I feel fill.na(0) needs to fill 0 regardless of 
duplicate column names.

[~cloud_fan] Thoughts


was (Author: sandeshyapuram):
I've raised it as a bug because I feel fill.na(0) needs to fill 0 regardless of 
duplicate column names.

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-14 Thread sandeshyapuram (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974087#comment-16974087
 ] 

sandeshyapuram commented on SPARK-29890:


I've raised it as a bug because I feel fill.na(0) needs to fill 0 regardless of 
duplicate column names.

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29891) Add built-in Array Functions: array_length

2019-11-14 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974086#comment-16974086
 ] 

jiaan.geng commented on SPARK-29891:


I'm working on

> Add built-in Array Functions: array_length
> --
>
> Key: SPARK-29891
> URL: https://issues.apache.org/jira/browse/SPARK-29891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> |{{array_length}}{{(}}{{anyarray}}{{, }}{{int}}{{)}}|{{int}}|returns the 
> length of the requested array dimension|{{array_length(array[1,2,3], 
> 1)}}|{{3}}|
> | | | | | |
> Other DBs:
> [https://phoenix.apache.org/language/functions.html#array_length]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29891) Add built-in Array Functions: array_length

2019-11-14 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-29891:
---
Description: 
|{{array_length}}{{(}}{{anyarray}}{{, }}{{int}}{{)}}|{{int}}|returns the length 
of the requested array dimension|{{array_length(array[1,2,3], 1)}}|{{3}}|
| | | | | |

Other DBs:

[https://phoenix.apache.org/language/functions.html#array_length]

  was:
|{{array_length}}{{(}}{{anyarray}}{{, }}{{int}}{{)}}|{{int}}|returns the length 
of the requested array dimension|{{array_length(array[1,2,3], 1)}}|{{3}}|
| | | | | |


> Add built-in Array Functions: array_length
> --
>
> Key: SPARK-29891
> URL: https://issues.apache.org/jira/browse/SPARK-29891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> |{{array_length}}{{(}}{{anyarray}}{{, }}{{int}}{{)}}|{{int}}|returns the 
> length of the requested array dimension|{{array_length(array[1,2,3], 
> 1)}}|{{3}}|
> | | | | | |
> Other DBs:
> [https://phoenix.apache.org/language/functions.html#array_length]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 108 matches

Mail list logo