[jira] [Resolved] (SPARK-29655) Enable adaptive execution should not add more ShuffleExchange
[ https://issues.apache.org/jira/browse/SPARK-29655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29655. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26409 [https://github.com/apache/spark/pull/26409] > Enable adaptive execution should not add more ShuffleExchange > - > > Key: SPARK-29655 > URL: https://issues.apache.org/jira/browse/SPARK-29655 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > Enable adaptive execution should not add more ShuffleExchange. How to > reproduce: > {code:scala} > import org.apache.spark.sql.SaveMode > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) > spark.conf.set("spark.sql.shuffle.partitions", 4) > val bucketedTableName = "bucketed_table" > spark.range(10).write.bucketBy(4, > "id").sortBy("id").mode(SaveMode.Overwrite).saveAsTable(bucketedTableName) > val bucketedTable = spark.table(bucketedTableName) > val df = spark.range(4) > df.join(bucketedTable, "id").explain() > spark.conf.set("spark.sql.adaptive.enabled", true) > spark.conf.set("spark.sql.adaptive.shuffle.maxNumPostShufflePartitions", 5) > df.join(bucketedTable, "id").explain() > {code} > Output: > {noformat} > == Physical Plan == > AdaptiveSparkPlan(isFinalPlan=false) > +- Project [id#5L] >+- SortMergeJoin [id#5L], [id#3L], Inner > :- Sort [id#5L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(id#5L, 5), true, [id=#92] > : +- Range (0, 4, step=1, splits=16) > +- Sort [id#3L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#3L, 5), true, [id=#93] > +- Project [id#3L] >+- Filter isnotnull(id#3L) > +- FileScan parquet default.bucketed_table[id#3L] Batched: > true, DataFilters: [isnotnull(id#3L)], Format: Parquet, Location: > InMemoryFileIndex[file:/root/spark-3.0.0-preview-bin-hadoop3.2/spark-warehouse/bucketed_table], > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct, SelectedBucketsCount: 4 out of 4 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29655) Enable adaptive execution should not add more ShuffleExchange
[ https://issues.apache.org/jira/browse/SPARK-29655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29655: --- Assignee: Yuming Wang > Enable adaptive execution should not add more ShuffleExchange > - > > Key: SPARK-29655 > URL: https://issues.apache.org/jira/browse/SPARK-29655 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Enable adaptive execution should not add more ShuffleExchange. How to > reproduce: > {code:scala} > import org.apache.spark.sql.SaveMode > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) > spark.conf.set("spark.sql.shuffle.partitions", 4) > val bucketedTableName = "bucketed_table" > spark.range(10).write.bucketBy(4, > "id").sortBy("id").mode(SaveMode.Overwrite).saveAsTable(bucketedTableName) > val bucketedTable = spark.table(bucketedTableName) > val df = spark.range(4) > df.join(bucketedTable, "id").explain() > spark.conf.set("spark.sql.adaptive.enabled", true) > spark.conf.set("spark.sql.adaptive.shuffle.maxNumPostShufflePartitions", 5) > df.join(bucketedTable, "id").explain() > {code} > Output: > {noformat} > == Physical Plan == > AdaptiveSparkPlan(isFinalPlan=false) > +- Project [id#5L] >+- SortMergeJoin [id#5L], [id#3L], Inner > :- Sort [id#5L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(id#5L, 5), true, [id=#92] > : +- Range (0, 4, step=1, splits=16) > +- Sort [id#3L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#3L, 5), true, [id=#93] > +- Project [id#3L] >+- Filter isnotnull(id#3L) > +- FileScan parquet default.bucketed_table[id#3L] Batched: > true, DataFilters: [isnotnull(id#3L)], Format: Parquet, Location: > InMemoryFileIndex[file:/root/spark-3.0.0-preview-bin-hadoop3.2/spark-warehouse/bucketed_table], > PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: > struct, SelectedBucketsCount: 4 out of 4 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29911) Cache table may memory leak when session closed
[ https://issues.apache.org/jira/browse/SPARK-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29911: --- Summary: Cache table may memory leak when session closed (was: Cache table may memory leak when session stopped) > Cache table may memory leak when session closed > --- > > Key: SPARK-29911 > URL: https://issues.apache.org/jira/browse/SPARK-29911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png > > > How to reproduce: > 1. create a local temporary view v1 > 2. cache it in memory > 3. close session without drop v1. > The application will hold the memory forever. In a long running thrift server > scenario. It's worse. > {code} > 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1; > CACHE TABLE testCacheTable AS SELECT 1; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (1.498 seconds) > 0: jdbc:hive2://localhost:1> !close > !close > Closing: 0: jdbc:hive2://localhost:1 > 0: jdbc:hive2://localhost:1 (closed)> !connect > 'jdbc:hive2://localhost:1' > !connect 'jdbc:hive2://localhost:1' > Connecting to jdbc:hive2://localhost:1 > Enter username for jdbc:hive2://localhost:1: > lajin > Enter password for jdbc:hive2://localhost:1: > *** > Connected to: Spark SQL (version 3.0.0-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 1: jdbc:hive2://localhost:1> select * from testCacheTable; > select * from testCacheTable; > Error: Error running query: org.apache.spark.sql.AnalysisException: Table or > view not found: testCacheTable; line 1 pos 14; > 'Project [*] > +- 'UnresolvedRelation [testCacheTable] (state=,code=0) > {code} > !Screen Shot 2019-11-15 at 2.03.49 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29911) Cache table may memory leak when session stopped
[ https://issues.apache.org/jira/browse/SPARK-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29911: --- Description: How to reproduce: 1. create a local temporary view v1 2. cache it in memory 3. close session without drop v1. The application will hold the memory forever. In a long running thrift server scenario. It's worse. {code} 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1; CACHE TABLE testCacheTable AS SELECT 1; +-+--+ | Result | +-+--+ +-+--+ No rows selected (1.498 seconds) 0: jdbc:hive2://localhost:1> !close !close Closing: 0: jdbc:hive2://localhost:1 0: jdbc:hive2://localhost:1 (closed)> !connect 'jdbc:hive2://localhost:1' !connect 'jdbc:hive2://localhost:1' Connecting to jdbc:hive2://localhost:1 Enter username for jdbc:hive2://localhost:1: lajin Enter password for jdbc:hive2://localhost:1: *** Connected to: Spark SQL (version 3.0.0-SNAPSHOT) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ 1: jdbc:hive2://localhost:1> select * from testCacheTable; select * from testCacheTable; Error: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found: testCacheTable; line 1 pos 14; 'Project [*] +- 'UnresolvedRelation [testCacheTable] (state=,code=0) {code} !Screen Shot 2019-11-15 at 2.03.49 PM.png! was: How to reproduce: 1. create a local temporary view v1 2. cache it in memory 3. close session without drop v1. The application will hold the memory forever. In a long running thrift server scenario. It's worse. {code} 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1; CACHE TABLE testCacheTable AS SELECT 1; +-+--+ | Result | +-+--+ +-+--+ No rows selected (1.498 seconds) 0: jdbc:hive2://localhost:1> !close !close Closing: 0: jdbc:hive2://localhost:1 0: jdbc:hive2://localhost:1 (closed)> !connect 'jdbc:hive2://localhost:1' !connect 'jdbc:hive2://localhost:1' Connecting to jdbc:hive2://localhost:1 Enter username for jdbc:hive2://localhost:1: lajin Enter password for jdbc:hive2://localhost:1: *** Connected to: Spark SQL (version 3.0.0-SNAPSHOT) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ 1: jdbc:hive2://localhost:1> select * from testCacheTable; select * from testCacheTable; Error: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found: testCacheTable; line 1 pos 14; 'Project [*] +- 'UnresolvedRelation [testCacheTable] (state=,code=0) {code} > Cache table may memory leak when session stopped > > > Key: SPARK-29911 > URL: https://issues.apache.org/jira/browse/SPARK-29911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png > > > How to reproduce: > 1. create a local temporary view v1 > 2. cache it in memory > 3. close session without drop v1. > The application will hold the memory forever. In a long running thrift server > scenario. It's worse. > {code} > 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1; > CACHE TABLE testCacheTable AS SELECT 1; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (1.498 seconds) > 0: jdbc:hive2://localhost:1> !close > !close > Closing: 0: jdbc:hive2://localhost:1 > 0: jdbc:hive2://localhost:1 (closed)> !connect > 'jdbc:hive2://localhost:1' > !connect 'jdbc:hive2://localhost:1' > Connecting to jdbc:hive2://localhost:1 > Enter username for jdbc:hive2://localhost:1: > lajin > Enter password for jdbc:hive2://localhost:1: > *** > Connected to: Spark SQL (version 3.0.0-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 1: jdbc:hive2://localhost:1> select * from testCacheTable; > select * from testCacheTable; > Error: Error running query: org.apache.spark.sql.AnalysisException: Table or > view not found: testCacheTable; line 1 pos 14; > 'Project [*] > +- 'UnresolvedRelation [testCacheTable] (state=,code=0) > {code} > !Screen Shot 2019-11-15 at 2.03.49 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29911) Cache table may memory leak when session stopped
[ https://issues.apache.org/jira/browse/SPARK-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29911: --- Attachment: Screen Shot 2019-11-15 at 2.03.49 PM.png > Cache table may memory leak when session stopped > > > Key: SPARK-29911 > URL: https://issues.apache.org/jira/browse/SPARK-29911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png > > > How to reproduce: > 1. create a local temporary view v1 > 2. cache it in memory > 3. close session without drop v1. > The application will hold the memory forever. In a long running thrift server > scenario. It's worse. > {code} > 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1; > CACHE TABLE testCacheTable AS SELECT 1; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (1.498 seconds) > 0: jdbc:hive2://localhost:1> !close > !close > Closing: 0: jdbc:hive2://localhost:1 > 0: jdbc:hive2://localhost:1 (closed)> !connect > 'jdbc:hive2://localhost:1' > !connect 'jdbc:hive2://localhost:1' > Connecting to jdbc:hive2://localhost:1 > Enter username for jdbc:hive2://localhost:1: > lajin > Enter password for jdbc:hive2://localhost:1: > *** > Connected to: Spark SQL (version 3.0.0-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 1: jdbc:hive2://localhost:1> select * from testCacheTable; > select * from testCacheTable; > Error: Error running query: org.apache.spark.sql.AnalysisException: Table or > view not found: testCacheTable; line 1 pos 14; > 'Project [*] > +- 'UnresolvedRelation [testCacheTable] (state=,code=0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29911) Cache table may memory leak when session stopped
[ https://issues.apache.org/jira/browse/SPARK-29911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29911: --- Description: How to reproduce: 1. create a local temporary view v1 2. cache it in memory 3. close session without drop v1. The application will hold the memory forever. In a long running thrift server scenario. It's worse. {code} 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1; CACHE TABLE testCacheTable AS SELECT 1; +-+--+ | Result | +-+--+ +-+--+ No rows selected (1.498 seconds) 0: jdbc:hive2://localhost:1> !close !close Closing: 0: jdbc:hive2://localhost:1 0: jdbc:hive2://localhost:1 (closed)> !connect 'jdbc:hive2://localhost:1' !connect 'jdbc:hive2://localhost:1' Connecting to jdbc:hive2://localhost:1 Enter username for jdbc:hive2://localhost:1: lajin Enter password for jdbc:hive2://localhost:1: *** Connected to: Spark SQL (version 3.0.0-SNAPSHOT) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ 1: jdbc:hive2://localhost:1> select * from testCacheTable; select * from testCacheTable; Error: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found: testCacheTable; line 1 pos 14; 'Project [*] +- 'UnresolvedRelation [testCacheTable] (state=,code=0) {code} was: How to reproduce: 1. create a local temporary view v1 2. cache it in memory 3. close session without drop v1. The application will hold the memory forever. In a long running thrift server scenario. It's worse. {code} 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1; CACHE TABLE testCacheTable AS SELECT 1; +-+--+ | Result | +-+--+ +-+--+ No rows selected (1.498 seconds) 0: jdbc:hive2://localhost:1> !close !close Closing: 0: jdbc:hive2://localhost:1 0: jdbc:hive2://localhost:1 (closed)> !connect 'jdbc:hive2://localhost:1' !connect 'jdbc:hive2://localhost:1' Connecting to jdbc:hive2://localhost:1 Enter username for jdbc:hive2://localhost:1: lajin lajin Enter password for jdbc:hive2://localhost:1: 123 *** Connected to: Spark SQL (version 3.0.0-SNAPSHOT) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ 1: jdbc:hive2://localhost:1> select * from testCacheTable; select * from testCacheTable; Error: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found: testCacheTable; line 1 pos 14; 'Project [*] +- 'UnresolvedRelation [testCacheTable] (state=,code=0) {code} > Cache table may memory leak when session stopped > > > Key: SPARK-29911 > URL: https://issues.apache.org/jira/browse/SPARK-29911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png > > > How to reproduce: > 1. create a local temporary view v1 > 2. cache it in memory > 3. close session without drop v1. > The application will hold the memory forever. In a long running thrift server > scenario. It's worse. > {code} > 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1; > CACHE TABLE testCacheTable AS SELECT 1; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (1.498 seconds) > 0: jdbc:hive2://localhost:1> !close > !close > Closing: 0: jdbc:hive2://localhost:1 > 0: jdbc:hive2://localhost:1 (closed)> !connect > 'jdbc:hive2://localhost:1' > !connect 'jdbc:hive2://localhost:1' > Connecting to jdbc:hive2://localhost:1 > Enter username for jdbc:hive2://localhost:1: > lajin > Enter password for jdbc:hive2://localhost:1: > *** > Connected to: Spark SQL (version 3.0.0-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 1: jdbc:hive2://localhost:1> select * from testCacheTable; > select * from testCacheTable; > Error: Error running query: org.apache.spark.sql.AnalysisException: Table or > view not found: testCacheTable; line 1 pos 14; > 'Project [*] > +- 'UnresolvedRelation [testCacheTable] (state=,code=0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29911) Cache table may memory leak when session stopped
Lantao Jin created SPARK-29911: -- Summary: Cache table may memory leak when session stopped Key: SPARK-29911 URL: https://issues.apache.org/jira/browse/SPARK-29911 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Lantao Jin Attachments: Screen Shot 2019-11-15 at 2.03.49 PM.png How to reproduce: 1. create a local temporary view v1 2. cache it in memory 3. close session without drop v1. The application will hold the memory forever. In a long running thrift server scenario. It's worse. {code} 0: jdbc:hive2://localhost:1> CACHE TABLE testCacheTable AS SELECT 1; CACHE TABLE testCacheTable AS SELECT 1; +-+--+ | Result | +-+--+ +-+--+ No rows selected (1.498 seconds) 0: jdbc:hive2://localhost:1> !close !close Closing: 0: jdbc:hive2://localhost:1 0: jdbc:hive2://localhost:1 (closed)> !connect 'jdbc:hive2://localhost:1' !connect 'jdbc:hive2://localhost:1' Connecting to jdbc:hive2://localhost:1 Enter username for jdbc:hive2://localhost:1: lajin lajin Enter password for jdbc:hive2://localhost:1: 123 *** Connected to: Spark SQL (version 3.0.0-SNAPSHOT) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ 1: jdbc:hive2://localhost:1> select * from testCacheTable; select * from testCacheTable; Error: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found: testCacheTable; line 1 pos 14; 'Project [*] +- 'UnresolvedRelation [testCacheTable] (state=,code=0) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29910) Add minimum runtime limit to speculation
Deegue created SPARK-29910: -- Summary: Add minimum runtime limit to speculation Key: SPARK-29910 URL: https://issues.apache.org/jira/browse/SPARK-29910 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Deegue The minimum runtime to speculation used to be a fixed value 100ms. It means tasks finished in seconds will also be speculated and more executors will be required. To resolve this, we add `spark.speculation.minRuntime` to control the minimum runtime limit for speculation. We can reduce normal tasks to be speculated by adjusting `spark.speculation.minRuntime`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29906) Reading of csv file fails with adaptive execution turned on
[ https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974836#comment-16974836 ] koert kuipers commented on SPARK-29906: --- i added a bit of debug logging: {code:java} $ git diff diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala index 375cec5971..7e5b7fb235 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala @@ -86,7 +86,7 @@ object CSVDataSource extends Logging { } } -object TextInputCSVDataSource extends CSVDataSource { +object TextInputCSVDataSource extends CSVDataSource with Logging { override val isSplitable: Boolean = true override def readFile( @@ -110,9 +110,13 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession: SparkSession, inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): StructType = { +logInfo(s"!! inputPaths ${inputPaths}") val csv = createBaseDataset(sparkSession, inputPaths, parsedOptions) val maybeFirstLine = CSVUtils.filterCommentAndEmpty(csv, parsedOptions).take(1).headOption -inferFromDataset(sparkSession, csv, maybeFirstLine, parsedOptions) +logInfo(s"!! maybeFirstLine ${maybeFirstLine}") +val schema = inferFromDataset(sparkSession, csv, maybeFirstLine, parsedOptions) +logInfo(s"!! schema ${schema}") +schema } {code} and this shows when spark.sql.adaptive.enabled=true: {code:java} 19/11/15 05:52:06 INFO csv.TextInputCSVDataSource: !! inputPaths List(LocatedFileStatus{path=hdfs://ip-xx-xxx-x-xxx.ec2.internal:8020/user/hadoop/OP_DTL_GNRL_PGYR2013_P06282019.csv; isDirectory=false; length=2242114396; replication=3; blocksize=134217728; modification_time=1573794115499; access_time=1573794109887; owner=hadoop; group=hadoop; permission=rw-r--r--; isSymlink=false}) 19/11/15 05:52:10 INFO csv.TextInputCSVDataSource: !! maybeFirstLine Some("UNCHANGED","Covered Recipient Physician""195068","SCOTT","KEVIN","FORMAN",,"360 SAN MIGUEL DR","SUITE 701","NEWPORT BEACH","CA","92660-7853","United States",,,"Medical Doctor","Allopathic & Osteopathic Physicians|Orthopaedic Surgery","CA","Wright Medical Technology, Inc.","10011065","Wright Medical Technology, Inc.","TN","United States",12.50,"08/20/2013","1","In-kind items and services","Food and Beverage""No","No Third Party Payment",,"No",,,"No","105165962","No","Covered","Foot and Ankle",,,"2013","06/28/2019") 19/11/15 05:52:10 INFO csv.TextInputCSVDataSource: !! schema StructType(StructField(UNCHANGED,StringType,true), StructField(Covered Recipient Physician,StringType,true), StructField(_c2,StringType,true), StructField(_c3,StringType,true), StructField(_c4,StringType,true), StructField(195068,StringType,true), StructField(SCOTT,StringType,true), StructField(KEVIN,StringType,true), StructField(FORMAN,StringType,true), StructField(_c9,StringType,true), StructField(360 SAN MIGUEL DR,StringType,true), StructField(SUITE 701,StringType,true), StructField(NEWPORT BEACH,StringType,true), StructField(CA13,StringType,true), StructField(92660-7853,StringType,true), StructField(United States15,StringType,true), StructField(_c16,StringType,true), StructField(_c17,StringType,true), StructField(Medical Doctor,StringType,true), StructField(Allopathic & Osteopathic Physicians|Orthopaedic Surgery,StringType,true), StructField(CA20,StringType,true), StructField(_c21,StringType,true), StructField(_c22,StringType,true), StructField(_c23,StringType,true), StructField(_c24,StringType,true), StructField(Wright Medical Technology, Inc.25,StringType,true), StructField(10011065,StringType,true), StructField(Wright Medical Technology, Inc.27,StringType,true), StructField(TN,StringType,true), StructField(United States29,StringType,true), StructField(12.50,StringType,true), StructField(08/20/2013,StringType,true), StructField(1,StringType,true), StructField(In-kind items and services,StringType,true), StructField(Food and Beverage,StringType,true), StructField(_c35,StringType,true), StructField(_c36,StringType,true), StructField(_c37,StringType,true), StructField(No38,StringType,true), StructField(No Third Party Payment,StringType,true), StructField(_c40,StringType,true), StructField(No41,StringType,true), StructField(_c42,StringType,true), StructField(_c43,StringType,true), StructField(No44,StringType,true), StructField(105165962,StringType,true), StructField(No46,StringType,true), StructField(Covered,StringType,true), StructField(Foot and Ankle,StringType,true), StructField(_c49,StringType,true), StructField(_c50,StringType,true),
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974824#comment-16974824 ] Terry Kim commented on SPARK-29900: --- Cool. I will compile the list and send it out to dev/user list. Thanks! > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974822#comment-16974822 ] Wenchen Fan commented on SPARK-29900: - Yea exactly! I don't think it's a big breaking change. We only break the cases when there are temp view and table with the same name, and users can use a qualified name to disambiguate. To move this forward, we need to: 1. find all the places that need to change the table resolution behavior (e.g. saveAsTable, DROP TABLE) 2. propose it to dev/user list 3. implement it > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation
[ https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974821#comment-16974821 ] Joachim Hereth commented on SPARK-29748: [~bryanc] With simply removing sorting we change the semantics, e.g. `Row(a=1, b=2) != Row(b=2, a=1)` (opposed to what we currently have. Also, there might be problems if data was written with Spark pre-change and read after the change. Adding workarounds (if possible) will make the code very complex. I think [~zero323] was thinking about changes for the upcoming 3.0? > Remove sorting of fields in PySpark SQL Row creation > > > Key: SPARK-29748 > URL: https://issues.apache.org/jira/browse/SPARK-29748 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently, when a PySpark Row is created with keyword arguments, the fields > are sorted alphabetically. This has created a lot of confusion with users > because it is not obvious (although it is stated in the pydocs) that they > will be sorted alphabetically, and then an error can occur later when > applying a schema and the field order does not match. > The original reason for sorting fields is because kwargs in python < 3.6 are > not guaranteed to be in the same order that they were entered. Sorting > alphabetically would ensure a consistent order. Matters are further > complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to > to be referenced by name when made by kwargs, but this flag is not serialized > with the Row and leads to inconsistent behavior. > This JIRA proposes that any sorting of the Fields is removed. Users with > Python 3.6+ creating Rows with kwargs can continue to do so since Python will > ensure the order is the same as entered. Users with Python < 3.6 will have to > create Rows with an OrderedDict or by using the Row class as a factory > (explained in the pydoc). If kwargs are used, an error will be raised or > based on a conf setting it can fall back to a LegacyRow that will sort the > fields as before. This LegacyRow will be immediately deprecated and removed > once support for Python < 3.6 is dropped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29888) New interval string parser parse '.111 seconds' to null
[ https://issues.apache.org/jira/browse/SPARK-29888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29888: --- Assignee: Kent Yao > New interval string parser parse '.111 seconds' to null > > > Key: SPARK-29888 > URL: https://issues.apache.org/jira/browse/SPARK-29888 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > Current string to interval cast logic does not support i.e. cast('.111 > second' as interval) which will fail in SIGN state and return null, actually, > it is 00:00:00.111. > {code:java} > These are the results of the master branch. > -- !query 63 > select interval '.111 seconds' > -- !query 63 schema > struct<0.111 seconds:interval> > -- !query 63 output > 0.111 seconds > -- !query 64 > select cast('.111 seconds' as interval) > -- !query 64 schema > struct > -- !query 64 output > NULL > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29888) New interval string parser parse '.111 seconds' to null
[ https://issues.apache.org/jira/browse/SPARK-29888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29888. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26514 [https://github.com/apache/spark/pull/26514] > New interval string parser parse '.111 seconds' to null > > > Key: SPARK-29888 > URL: https://issues.apache.org/jira/browse/SPARK-29888 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Current string to interval cast logic does not support i.e. cast('.111 > second' as interval) which will fail in SIGN state and return null, actually, > it is 00:00:00.111. > {code:java} > These are the results of the master branch. > -- !query 63 > select interval '.111 seconds' > -- !query 63 schema > struct<0.111 seconds:interval> > -- !query 63 output > 0.111 seconds > -- !query 64 > select cast('.111 seconds' as interval) > -- !query 64 schema > struct > -- !query 64 output > NULL > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28859) Remove value check of MEMORY_OFFHEAP_SIZE in declaration section
[ https://issues.apache.org/jira/browse/SPARK-28859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28859. --- Assignee: (was: yifan) Resolution: Invalid According to the the test failures on the PR , I'm closing this issue as `Invalid`. Since `MemoryManager` validates this already, it's enough. > Remove value check of MEMORY_OFFHEAP_SIZE in declaration section > > > Key: SPARK-28859 > URL: https://issues.apache.org/jira/browse/SPARK-28859 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > > Now MEMORY_OFFHEAP_SIZE has default value 0, but It should be greater than 0 > when > MEMORY_OFFHEAP_ENABLED is true,, should we check this condition in code? > > SPARK-28577 add this check before request memory resource to Yarn > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29892) Add built-in Array Functions: array_cat
[ https://issues.apache.org/jira/browse/SPARK-29892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974800#comment-16974800 ] Aman Omer commented on SPARK-29892: --- This Jira is a duplicate of https://issues.apache.org/jira/browse/SPARK-29737 . > Add built-in Array Functions: array_cat > --- > > Key: SPARK-29892 > URL: https://issues.apache.org/jira/browse/SPARK-29892 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > |{{array_cat}}{{(}}{{anyarray}}{{, > }}{{anyarray}}{{)}}|{{anyarray}}|concatenate two > arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{\{ {1,2,3,4,5}}}| > Other DBs: > [https://phoenix.apache.org/language/functions.html#array_cat] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27884) Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27884: - Description: Officially deprecate Python 2 support and and Python 3 prior to 3.6 in Spark 3.0. dev list: http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCEMENT-Plan-for-dropping-Python-2-support-td27335.html http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Deprecate-Python-lt-3-6-in-Spark-3-0-td28168.html was:Officially deprecate Python 2 support and and Python 3 prior to 3.6 in Spark 3.0. > Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0 > - > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Officially deprecate Python 2 support and and Python 3 prior to 3.6 in Spark > 3.0. > dev list: > http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCEMENT-Plan-for-dropping-Python-2-support-td27335.html > http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Deprecate-Python-lt-3-6-in-Spark-3-0-td28168.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27884) Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27884: - Description: Officially deprecate Python 2 support and and Python 3 prior to 3.6 in Spark 3.0. (was: Officially deprecate Python 2 support in Spark 3.0.) > Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0 > - > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Officially deprecate Python 2 support and and Python 3 prior to 3.6 in Spark > 3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27884) Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27884. -- Fix Version/s: 3.0.0 Resolution: Done > Deprecate Python 2 and Python 3 prior to 3.6 support in Spark 3.0 > - > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Officially deprecate Python 2 support in Spark 3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974791#comment-16974791 ] Hyukjin Kwon commented on SPARK-29803: -- (I converted into a subtask of a new JIRA) > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Sub-task > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29803: - Parent: SPARK-29909 Issue Type: Sub-task (was: Bug) > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Sub-task > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29897) Implicit cast to timestamp is failing
[ https://issues.apache.org/jira/browse/SPARK-29897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974789#comment-16974789 ] Ankit Raj Boudh commented on SPARK-29897: - i will raise PR for this > Implicit cast to timestamp is failing > -- > > Key: SPARK-29897 > URL: https://issues.apache.org/jira/browse/SPARK-29897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark can not cast implicitly > jdbc:hive2://10.18.19.208:23040/default> SELECT EXTRACT(DAY FROM NOW() - > '2014-08-02 08:10:56'); > Error: org.apache.spark.sql.AnalysisException: cannot resolve > '(current_timestamp() - CAST('2014-08-02 08:10:56' AS DOUBLE))' due to data > type mismatch: differing types in '(current_timestamp() - CAST('2014-08-02 > 08:10:56' AS DOUBLE))' (timestamp and double).; line 1 pos 24; > PostgreSQL and MySQL can handle the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29803: - Parent: (was: SPARK-27884) Issue Type: Bug (was: Sub-task) > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Bug > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29802) update remaining python scripts in repo to python3 shebang
[ https://issues.apache.org/jira/browse/SPARK-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974788#comment-16974788 ] Hyukjin Kwon commented on SPARK-29802: -- (I converted into a subtask of a new JIRA) > update remaining python scripts in repo to python3 shebang > -- > > Key: SPARK-29802 > URL: https://issues.apache.org/jira/browse/SPARK-29802 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > > there are a bunch of scripts in the repo that need to have their shebang > updated to python3: > {noformat} > dev/create-release/releaseutils.py:#!/usr/bin/env python > dev/create-release/generate-contributors.py:#!/usr/bin/env python > dev/create-release/translate-contributors.py:#!/usr/bin/env python > dev/github_jira_sync.py:#!/usr/bin/env python > dev/merge_spark_pr.py:#!/usr/bin/env python > python/pyspark/version.py:#!/usr/bin/env python > python/pyspark/find_spark_home.py:#!/usr/bin/env python > python/setup.py:#!/usr/bin/env python{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29909) Drop Python 2 and Python 3.4 and 3.5.
[ https://issues.apache.org/jira/browse/SPARK-29909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29909: - Issue Type: Umbrella (was: Bug) > Drop Python 2 and Python 3.4 and 3.5. > - > > Key: SPARK-29909 > URL: https://issues.apache.org/jira/browse/SPARK-29909 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We deprecated PySpark at SPARK-27884. We should drop at Spark 3.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29802) update remaining python scripts in repo to python3 shebang
[ https://issues.apache.org/jira/browse/SPARK-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29802: - Parent: (was: SPARK-27884) Issue Type: Bug (was: Sub-task) > update remaining python scripts in repo to python3 shebang > -- > > Key: SPARK-29802 > URL: https://issues.apache.org/jira/browse/SPARK-29802 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > > there are a bunch of scripts in the repo that need to have their shebang > updated to python3: > {noformat} > dev/create-release/releaseutils.py:#!/usr/bin/env python > dev/create-release/generate-contributors.py:#!/usr/bin/env python > dev/create-release/translate-contributors.py:#!/usr/bin/env python > dev/github_jira_sync.py:#!/usr/bin/env python > dev/merge_spark_pr.py:#!/usr/bin/env python > python/pyspark/version.py:#!/usr/bin/env python > python/pyspark/find_spark_home.py:#!/usr/bin/env python > python/setup.py:#!/usr/bin/env python{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29802) update remaining python scripts in repo to python3 shebang
[ https://issues.apache.org/jira/browse/SPARK-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29802: - Parent: SPARK-29909 Issue Type: Sub-task (was: Bug) > update remaining python scripts in repo to python3 shebang > -- > > Key: SPARK-29802 > URL: https://issues.apache.org/jira/browse/SPARK-29802 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > > there are a bunch of scripts in the repo that need to have their shebang > updated to python3: > {noformat} > dev/create-release/releaseutils.py:#!/usr/bin/env python > dev/create-release/generate-contributors.py:#!/usr/bin/env python > dev/create-release/translate-contributors.py:#!/usr/bin/env python > dev/github_jira_sync.py:#!/usr/bin/env python > dev/merge_spark_pr.py:#!/usr/bin/env python > python/pyspark/version.py:#!/usr/bin/env python > python/pyspark/find_spark_home.py:#!/usr/bin/env python > python/setup.py:#!/usr/bin/env python{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29909) Drop Python 2 and Python 3.4 and 3.5.
Hyukjin Kwon created SPARK-29909: Summary: Drop Python 2 and Python 3.4 and 3.5. Key: SPARK-29909 URL: https://issues.apache.org/jira/browse/SPARK-29909 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.1.0 Reporter: Hyukjin Kwon We deprecated PySpark at SPARK-27884. We should drop at Spark 3.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28752) Documentation build script to support Python 3
[ https://issues.apache.org/jira/browse/SPARK-28752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28752. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26521 [https://github.com/apache/spark/pull/26521] > Documentation build script to support Python 3 > -- > > Key: SPARK-28752 > URL: https://issues.apache.org/jira/browse/SPARK-28752 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > Seems documentation build: > https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html > doesn't support Python 3. We should support it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28752) Documentation build script to support Python 3
[ https://issues.apache.org/jira/browse/SPARK-28752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28752: Assignee: Hyukjin Kwon > Documentation build script to support Python 3 > -- > > Key: SPARK-28752 > URL: https://issues.apache.org/jira/browse/SPARK-28752 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Seems documentation build: > https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html > doesn't support Python 3. We should support it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29376) Upgrade Apache Arrow to 0.15.1
[ https://issues.apache.org/jira/browse/SPARK-29376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29376: - Fix Version/s: 3.0.0 > Upgrade Apache Arrow to 0.15.1 > -- > > Key: SPARK-29376 > URL: https://issues.apache.org/jira/browse/SPARK-29376 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > > Apache Arrow 0.15.0 was just released see > [https://arrow.apache.org/blog/2019/10/06/0.15.0-release/] > There are a number of fixes and improvements including a change to the binary > IPC format https://issues.apache.org/jira/browse/ARROW-6313. > The next planned release will be 1.0.0, so it would be good to upgrade Spark > as a preliminary step. > Updated to use Apache Arrow 0.15.1, which was released soon after 0.15.0 to > include some important bug fixes. > change log at https://arrow.apache.org/release/0.15.1.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29376) Upgrade Apache Arrow to 0.15.1
[ https://issues.apache.org/jira/browse/SPARK-29376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29376. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/26133 > Upgrade Apache Arrow to 0.15.1 > -- > > Key: SPARK-29376 > URL: https://issues.apache.org/jira/browse/SPARK-29376 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Apache Arrow 0.15.0 was just released see > [https://arrow.apache.org/blog/2019/10/06/0.15.0-release/] > There are a number of fixes and improvements including a change to the binary > IPC format https://issues.apache.org/jira/browse/ARROW-6313. > The next planned release will be 1.0.0, so it would be good to upgrade Spark > as a preliminary step. > Updated to use Apache Arrow 0.15.1, which was released soon after 0.15.0 to > include some important bug fixes. > change log at https://arrow.apache.org/release/0.15.1.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29908) Add a Python, Pandas and PyArrow versions in clue at SQL query tests
Hyukjin Kwon created SPARK-29908: Summary: Add a Python, Pandas and PyArrow versions in clue at SQL query tests Key: SPARK-29908 URL: https://issues.apache.org/jira/browse/SPARK-29908 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Once Python test cases is failed in integrated UDF test cases, it's difficult to find out the version informations. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113828/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql___Scalar_Pandas_UDF/ as an example It might be better to add the version information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29906) Reading of csv file fails with adaptive execution turned on
[ https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-29906: -- Description: we observed an issue where spark seems to confuse a data line (not the first line of the csv file) for the csv header when it creates the schema. {code} $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP $ unzip PGYR13_P062819.ZIP $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf spark.sql.adaptive.enabled=true --num-executors 10 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040 Spark context available as 'sc' (master = yarn, app id = application_1573772077642_0006). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) Type in expressions to have them evaluated. Type :help for more information. scala> spark.read.format("csv").option("header", true).option("enforceSchema", false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1) 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. [Stage 2:>(0 + 10) / 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, Recipient_Primary_Business_Street_Address_Line2, Recipient_City, Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, Physician_License_State_code1, Physician_License_State_code2, Physician_License_State_code3, Physician_License_State_code4, Physician_License_State_code5, Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, Total_Amount_of_Payment_USDollars, Date_of_Payment, Number_of_Payments_Included_in_Total_Amount, Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, City_of_Travel, State_of_Travel, Country_of_Travel, Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, Charity_Indicator, Third_Party_Equals_Covered_Recipient_Indicator, Contextual_Information, Delay_in_Publication_Indicator, Record_ID, Dispute_Status_for_Publication, Product_Indicator, Name_of_Associated_Covered_Drug_or_Biological1, Name_of_Associated_Covered_Drug_or_Biological2, Name_of_Associated_Covered_Drug_or_Biological3, Name_of_Associated_Covered_Drug_or_Biological4, Name_of_Associated_Covered_Drug_or_Biological5, NDC_of_Associated_Covered_Drug_or_Biological1, NDC_of_Associated_Covered_Drug_or_Biological2, NDC_of_Associated_Covered_Drug_or_Biological3, NDC_of_Associated_Covered_Drug_or_Biological4, NDC_of_Associated_Covered_Drug_or_Biological5, Name_of_Associated_Covered_Device_or_Medical_Supply1, Name_of_Associated_Covered_Device_or_Medical_Supply2, Name_of_Associated_Covered_Device_or_Medical_Supply3, Name_of_Associated_Covered_Device_or_Medical_Supply4, Name_of_Associated_Covered_Device_or_Medical_Supply5, Program_Year, Payment_Publication_Date Schema: UNCHANGED, Covered Recipient Physician, _c2, _c3, _c4, 278352, JOHN, M, RAY, JR, 3625 CAPE CENTER DR, _c11, FAYETTEVILLE, NC13, 28304-4457, United States15, _c16, _c17, Medical Doctor, Allopathic & Osteopathic Physicians|Family Medicine, NC20, _c21, _c22, _c23, _c24, Par Pharmaceutical, Inc.25, 10010989, Par Pharmaceutical, Inc.27, NY, United States29, 17.29, 10/23/2013, 1, In-kind items and services, Food and Beverage, _c35, _c36, _c37, No38, No Third Party Payment, _c40, _c41, _c42, _c43, No44, 104522962, No46, Covered, MEGACE ES MEGESTROL ACETATE, _c49, _c50, _c51, _c52, 4988409496, _c54, _c55, _c56, _c57,
[jira] [Created] (SPARK-29907) Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte.
Xianyin Xin created SPARK-29907: --- Summary: Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte. Key: SPARK-29907 URL: https://issues.apache.org/jira/browse/SPARK-29907 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin SPARK-27444 introduced `dmlStatementNoWith` so that any dml that needs cte support can leverage it. It be better if we move DELETE/UPDATE/MERGE rules to `dmlStatementNoWith`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29906) Reading of csv file fails with adaptive execution turned on
[ https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-29906: -- Description: we observed an issue where spark seems to confuse a data line (not the first line of the csv file) for the csv header. {code} $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP $ unzip PGYR13_P062819.ZIP $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf spark.sql.adaptive.enabled=true --num-executors 10 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040 Spark context available as 'sc' (master = yarn, app id = application_1573772077642_0006). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) Type in expressions to have them evaluated. Type :help for more information. scala> spark.read.format("csv").option("header", true).option("enforceSchema", false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1) 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. [Stage 2:>(0 + 10) / 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, Recipient_Primary_Business_Street_Address_Line2, Recipient_City, Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, Physician_License_State_code1, Physician_License_State_code2, Physician_License_State_code3, Physician_License_State_code4, Physician_License_State_code5, Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, Total_Amount_of_Payment_USDollars, Date_of_Payment, Number_of_Payments_Included_in_Total_Amount, Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, City_of_Travel, State_of_Travel, Country_of_Travel, Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, Charity_Indicator, Third_Party_Equals_Covered_Recipient_Indicator, Contextual_Information, Delay_in_Publication_Indicator, Record_ID, Dispute_Status_for_Publication, Product_Indicator, Name_of_Associated_Covered_Drug_or_Biological1, Name_of_Associated_Covered_Drug_or_Biological2, Name_of_Associated_Covered_Drug_or_Biological3, Name_of_Associated_Covered_Drug_or_Biological4, Name_of_Associated_Covered_Drug_or_Biological5, NDC_of_Associated_Covered_Drug_or_Biological1, NDC_of_Associated_Covered_Drug_or_Biological2, NDC_of_Associated_Covered_Drug_or_Biological3, NDC_of_Associated_Covered_Drug_or_Biological4, NDC_of_Associated_Covered_Drug_or_Biological5, Name_of_Associated_Covered_Device_or_Medical_Supply1, Name_of_Associated_Covered_Device_or_Medical_Supply2, Name_of_Associated_Covered_Device_or_Medical_Supply3, Name_of_Associated_Covered_Device_or_Medical_Supply4, Name_of_Associated_Covered_Device_or_Medical_Supply5, Program_Year, Payment_Publication_Date Schema: UNCHANGED, Covered Recipient Physician, _c2, _c3, _c4, 278352, JOHN, M, RAY, JR, 3625 CAPE CENTER DR, _c11, FAYETTEVILLE, NC13, 28304-4457, United States15, _c16, _c17, Medical Doctor, Allopathic & Osteopathic Physicians|Family Medicine, NC20, _c21, _c22, _c23, _c24, Par Pharmaceutical, Inc.25, 10010989, Par Pharmaceutical, Inc.27, NY, United States29, 17.29, 10/23/2013, 1, In-kind items and services, Food and Beverage, _c35, _c36, _c37, No38, No Third Party Payment, _c40, _c41, _c42, _c43, No44, 104522962, No46, Covered, MEGACE ES MEGESTROL ACETATE, _c49, _c50, _c51, _c52, 4988409496, _c54, _c55, _c56, _c57, _c58, _c59, _c60, _c61,
[jira] [Created] (SPARK-29906) Reading of csv file fails with adaptive execution turned on
koert kuipers created SPARK-29906: - Summary: Reading of csv file fails with adaptive execution turned on Key: SPARK-29906 URL: https://issues.apache.org/jira/browse/SPARK-29906 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: build from master today nov 14 commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, upstream/master, upstream/HEAD) Author: Kevin Yu Date: Thu Nov 14 14:58:32 2019 -0600 build using: $ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn deployed on AWS EMR 5.28 with 10 m5.xlarge slaves in spark-env.sh: HADOOP_CONF_DIR=/etc/hadoop/conf in spark-defaults.conf: spark.master yarn spark.submit.deployMode client spark.serializer org.apache.spark.serializer.KryoSerializer spark.hadoop.yarn.timeline-service.enabled false spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native Reporter: koert kuipers we observed an issue where spark seems to confuse a data line (not the first line of the csv file) for the csv header. {code} $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP $ unzip PGYR13_P062819.ZIP $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf spark.sql.adaptive.enabled=true --num-executors 10 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040 Spark context available as 'sc' (master = yarn, app id = application_1573772077642_0006). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222) Type in expressions to have them evaluated. Type :help for more information. scala> spark.read.format("csv").option("header", true).option("enforceSchema", false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1) 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. [Stage 2:>(0 + 10) / 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, Recipient_Primary_Business_Street_Address_Line2, Recipient_City, Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, Physician_License_State_code1, Physician_License_State_code2, Physician_License_State_code3, Physician_License_State_code4, Physician_License_State_code5, Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, Total_Amount_of_Payment_USDollars, Date_of_Payment, Number_of_Payments_Included_in_Total_Amount, Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, City_of_Travel, State_of_Travel, Country_of_Travel, Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, Charity_Indicator, Third_Party_Equals_Covered_Recipient_Indicator, Contextual_Information, Delay_in_Publication_Indicator, Record_ID, Dispute_Status_for_Publication, Product_Indicator, Name_of_Associated_Covered_Drug_or_Biological1, Name_of_Associated_Covered_Drug_or_Biological2, Name_of_Associated_Covered_Drug_or_Biological3, Name_of_Associated_Covered_Drug_or_Biological4, Name_of_Associated_Covered_Drug_or_Biological5, NDC_of_Associated_Covered_Drug_or_Biological1, NDC_of_Associated_Covered_Drug_or_Biological2,
[jira] [Commented] (SPARK-26499) JdbcUtils.makeGetter does not handle ByteType
[ https://issues.apache.org/jira/browse/SPARK-26499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974757#comment-16974757 ] Dongjoon Hyun commented on SPARK-26499: --- This is backported to branch-2.4 via https://github.com/apache/spark/pull/26531 > JdbcUtils.makeGetter does not handle ByteType > - > > Key: SPARK-26499 > URL: https://issues.apache.org/jira/browse/SPARK-26499 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thomas D'Silva >Assignee: Thomas D'Silva >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > I am trying to use the DataSource V2 API to read from a JDBC source. While > using {{JdbcUtils.resultSetToSparkInternalRows}} to create an internal row > from a ResultSet that has a column of type TINYINT I ran into the following > exception > {code:java} > java.lang.IllegalArgumentException: Unsupported type tinyint > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter(JdbcUtils.scala:502) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters(JdbcUtils.scala:379) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.(JdbcUtils.scala:340) > {code} > This happens because ByteType is not handled in {{JdbcUtils.makeGetter}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26499) JdbcUtils.makeGetter does not handle ByteType
[ https://issues.apache.org/jira/browse/SPARK-26499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26499: -- Fix Version/s: 2.4.5 > JdbcUtils.makeGetter does not handle ByteType > - > > Key: SPARK-26499 > URL: https://issues.apache.org/jira/browse/SPARK-26499 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thomas D'Silva >Assignee: Thomas D'Silva >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > I am trying to use the DataSource V2 API to read from a JDBC source. While > using {{JdbcUtils.resultSetToSparkInternalRows}} to create an internal row > from a ResultSet that has a column of type TINYINT I ran into the following > exception > {code:java} > java.lang.IllegalArgumentException: Unsupported type tinyint > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter(JdbcUtils.scala:502) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters(JdbcUtils.scala:379) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.(JdbcUtils.scala:340) > {code} > This happens because ByteType is not handled in {{JdbcUtils.makeGetter}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28602) Recognize interval as a numeric type
[ https://issues.apache.org/jira/browse/SPARK-28602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-28602. -- Fix Version/s: 3.0.0 Target Version/s: 3.0.0 Resolution: Duplicate > Recognize interval as a numeric type > > > Key: SPARK-28602 > URL: https://issues.apache.org/jira/browse/SPARK-28602 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > Fix For: 3.0.0 > > > Hello, > Spark does not recognize `interval` type as a `numeric` one, which means that > we can't use `interval` columns in aggregated functions. For instance, the > following query works on PgSQL but does not work on Spark: > {code:sql}SELECT i,AVG(cast(v as interval)) OVER (ORDER BY i ROWS BETWEEN > CURRENT ROW AND UNBOUNDED FOLLOWING) FROM (VALUES(1,'1 sec'),(2,'2 > sec'),(3,NULL),(4,NULL)) t(i,v);{code} > {code:sql}cannot resolve 'avg(CAST(`v` AS INTERVAL))' due to data type > mismatch: function average requires numeric types, not interval; line 1 pos > 9{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29889) unify the interval tests
[ https://issues.apache.org/jira/browse/SPARK-29889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29889. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26515 [https://github.com/apache/spark/pull/26515] > unify the interval tests > > > Key: SPARK-29889 > URL: https://issues.apache.org/jira/browse/SPARK-29889 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29901) Fix broken links in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-29901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29901. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26528 [https://github.com/apache/spark/pull/26528] > Fix broken links in SQL Reference > - > > Key: SPARK-29901 > URL: https://issues.apache.org/jira/browse/SPARK-29901 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > Fix the broken links -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29905) ExecutorPodsLifecycleManager has sub-optimal behavior with dynamic allocation
Marcelo Masiero Vanzin created SPARK-29905: -- Summary: ExecutorPodsLifecycleManager has sub-optimal behavior with dynamic allocation Key: SPARK-29905 URL: https://issues.apache.org/jira/browse/SPARK-29905 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.0.0 Reporter: Marcelo Masiero Vanzin I've been playing with dynamic allocation on k8s and noticed some weird behavior from ExecutorPodsLifecycleManager when it's on. The cause of this behavior is mostly because of the higher rate of pod updates when you have dynamic allocation. Pods being created and going away all the time generate lots of events, that are then translated into "snapshots" internally in Spark, and fed to subscribers such as ExecutorPodsLifecycleManager. The first effect of that is that you get a lot of spurious logging. Since snapshots are incremental, you can get lots of snapshots with the same "PodDeleted" information, for example, and ExecutorPodsLifecycleManager will log for all of them. Yes, log messages are at debug level, but if you're debugging that stuff, it's really noisy and distracting. The second effect is that the same way you get multiple log messages, you end up calling into the Spark scheduler, and worse, into the K8S API server, multiple times for the same pod update. We can optimize that and reduce the chattiness with the API server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25185) CBO rowcount statistics doesn't work for partitioned parquet external table
[ https://issues.apache.org/jira/browse/SPARK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974724#comment-16974724 ] venkata yerubandi edited comment on SPARK-25185 at 11/15/19 1:21 AM: - Is there any update on this issue ? we are facing the same issue in spark 2.4.0 was (Author: raoyvn): Is there any update on this issue ? we are facing the same issue > CBO rowcount statistics doesn't work for partitioned parquet external table > --- > > Key: SPARK-25185 > URL: https://issues.apache.org/jira/browse/SPARK-25185 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.2.1, 2.3.0 > Environment: > Tried on Ubuntu, FreBSD and windows, running spark-shell in local mode > reading data from local file system >Reporter: Amit >Priority: Major > > Created a dummy partitioned data with partition column on string type col1=a > and col1=b > added csv data-> read through spark -> created partitioned external table-> > msck repair table to load partition. Did analyze on all columns and partition > column as well. > ~println(spark.sql("select * from test_p where > e='1a'").queryExecution.toStringWithStats)~ > ~val op = spark.sql("select * from test_p where > e='1a'").queryExecution.optimizedPlan~ > // e is the partitioned column > ~val stat = op.stats(spark.sessionState.conf)~ > ~print(stat.rowCount)~ > > Created the same way in parquet the rowcount comes up correctly in case of > csv but in parquet it shows as None. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25185) CBO rowcount statistics doesn't work for partitioned parquet external table
[ https://issues.apache.org/jira/browse/SPARK-25185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974724#comment-16974724 ] venkata yerubandi commented on SPARK-25185: --- Is there any update on this issue ? we are facing the same issue > CBO rowcount statistics doesn't work for partitioned parquet external table > --- > > Key: SPARK-25185 > URL: https://issues.apache.org/jira/browse/SPARK-25185 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.2.1, 2.3.0 > Environment: > Tried on Ubuntu, FreBSD and windows, running spark-shell in local mode > reading data from local file system >Reporter: Amit >Priority: Major > > Created a dummy partitioned data with partition column on string type col1=a > and col1=b > added csv data-> read through spark -> created partitioned external table-> > msck repair table to load partition. Did analyze on all columns and partition > column as well. > ~println(spark.sql("select * from test_p where > e='1a'").queryExecution.toStringWithStats)~ > ~val op = spark.sql("select * from test_p where > e='1a'").queryExecution.optimizedPlan~ > // e is the partitioned column > ~val stat = op.stats(spark.sessionState.conf)~ > ~print(stat.rowCount)~ > > Created the same way in parquet the rowcount comes up correctly in case of > csv but in parquet it shows as None. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.
[ https://issues.apache.org/jira/browse/SPARK-29857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29857. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26482 [https://github.com/apache/spark/pull/26482] > [WEB UI] Support defer render the spark history summary page. > -- > > Key: SPARK-29857 > URL: https://issues.apache.org/jira/browse/SPARK-29857 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Assignee: feiwang >Priority: Minor > Fix For: 3.0.0 > > > When there are many applications in spark history server, the renderer of > history summary page is heavy, we can enable deferRender to tuning it. > See details https://datatables.net/examples/ajax/defer_render.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.
[ https://issues.apache.org/jira/browse/SPARK-29857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29857: Assignee: feiwang > [WEB UI] Support defer render the spark history summary page. > -- > > Key: SPARK-29857 > URL: https://issues.apache.org/jira/browse/SPARK-29857 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Assignee: feiwang >Priority: Minor > > When there are many applications in spark history server, the renderer of > history summary page is heavy, we can enable deferRender to tuning it. > See details https://datatables.net/examples/ajax/defer_render.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation
[ https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974678#comment-16974678 ] Bryan Cutler commented on SPARK-29748: -- Thanks for discussing [~zero323] . The goal here is to only remove the sorting of fields, which causes all kinds of weird inconsistencies like in your above example. I'd prefer to leave efficient field access for another time. Since Row is a subclass of tuple, accessing fields by name has never been efficient and I don't want to change the fundamental design here. The only reason to introduce LegacyRow (which will be deprecated) is to maintain backward compatibility with existing code that expects fields to be sorted. > Remove sorting of fields in PySpark SQL Row creation > > > Key: SPARK-29748 > URL: https://issues.apache.org/jira/browse/SPARK-29748 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently, when a PySpark Row is created with keyword arguments, the fields > are sorted alphabetically. This has created a lot of confusion with users > because it is not obvious (although it is stated in the pydocs) that they > will be sorted alphabetically, and then an error can occur later when > applying a schema and the field order does not match. > The original reason for sorting fields is because kwargs in python < 3.6 are > not guaranteed to be in the same order that they were entered. Sorting > alphabetically would ensure a consistent order. Matters are further > complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to > to be referenced by name when made by kwargs, but this flag is not serialized > with the Row and leads to inconsistent behavior. > This JIRA proposes that any sorting of the Fields is removed. Users with > Python 3.6+ creating Rows with kwargs can continue to do so since Python will > ensure the order is the same as entered. Users with Python < 3.6 will have to > create Rows with an OrderedDict or by using the Row class as a factory > (explained in the pydoc). If kwargs are used, an error will be raised or > based on a conf setting it can fall back to a LegacyRow that will sort the > fields as before. This LegacyRow will be immediately deprecated and removed > once support for Python < 3.6 is dropped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29865) k8s executor pods all have different prefixes in client mode
[ https://issues.apache.org/jira/browse/SPARK-29865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Erlandson reassigned SPARK-29865: -- Assignee: Marcelo Masiero Vanzin > k8s executor pods all have different prefixes in client mode > > > Key: SPARK-29865 > URL: https://issues.apache.org/jira/browse/SPARK-29865 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Minor > Fix For: 3.0.0 > > > This works in cluster mode since the features set things up so that all > executor pods have the same name prefix. > But in client mode features are not used; so each executor ends up with a > different name prefix, which makes debugging a little bit annoying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29865) k8s executor pods all have different prefixes in client mode
[ https://issues.apache.org/jira/browse/SPARK-29865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Erlandson resolved SPARK-29865. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26488 [https://github.com/apache/spark/pull/26488] > k8s executor pods all have different prefixes in client mode > > > Key: SPARK-29865 > URL: https://issues.apache.org/jira/browse/SPARK-29865 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Masiero Vanzin >Priority: Minor > Fix For: 3.0.0 > > > This works in cluster mode since the features set things up so that all > executor pods have the same name prefix. > But in client mode features are not used; so each executor ends up with a > different name prefix, which makes debugging a little bit annoying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974655#comment-16974655 ] Terry Kim commented on SPARK-29900: --- If we make the relation lookup behavior consistent such that 1) temp views are resolved first 2) then tables are resolved, [~brkyvz], for your example, {code} // Create temporary view 't' spark.sql("create temporary view t as select 2 as i"); // BREAKING CHANGE: currently, the following is allowed. // But with the new resolution behavior, this should not be allowed (same as the postgresql behavior) spark.range(0, 5).write.saveAsTable("t") // you should be able to qualify the table name to make it work. spark.range(0, 5).write.saveAsTable("default.t") {code} For the DROP behavior: {code} spark.sql("show tables").show ++-+---+ |database|tableName|isTemporary| ++-+---+ | default|t| false| ||t| true| ++-+---+ // BREAKING CHANGE: currently, the following is allowed and drops the view. // But it should say '"t" is not a table'. spark.sql("drop table t") {code} [~rdblue], yes, this will be a breaking change. [~cloud_fan] is this in line with what you were thinking? > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29884) spark-submit to kuberentes can not parse valid ca certificate
[ https://issues.apache.org/jira/browse/SPARK-29884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974621#comment-16974621 ] Jeremy edited comment on SPARK-29884 at 11/14/19 9:34 PM: -- After doing some debugging it seams like this might be in fabric k8s client. It tries to use .kube/config even if it gets all the parameters is needs from arguments. was (Author: jeremyjjbrown): After doing some debugging it seams like this might be in fabric k8s client. I tries to use .kube/config even if it gets all the parameters is needs from arguments. > spark-submit to kuberentes can not parse valid ca certificate > - > > Key: SPARK-29884 > URL: https://issues.apache.org/jira/browse/SPARK-29884 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.4 > Environment: A kuberentes cluster that has been in use for over 2 > years and handles large amounts of production payloads. >Reporter: Jeremy >Priority: Major > > spark submit can not be used to to schedule to kuberentes with oauth token > and cacert > {code:java} > spark-submit \ > --deploy-mode cluster \ > --class org.apache.spark.examples.SparkPi \ > --master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \ > --conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf > spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt > \ > --conf spark.kubernetes.namespace=here-olp-3dds-sit \ > --conf spark.executor.instances=1 \ > --conf spark.app.name=spark-pi \ > --conf > spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 > \ > --conf > spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 > \ > local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar > {code} > returns > {code:java} > log4j:WARN No appenders could be found for logger > (io.fabric8.kubernetes.client.Config). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53) > at > io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183) > at > org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.security.cert.CertificateException: Could not parse > certificate: java.io.IOException: Empty input > at > sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110) > at > java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339) > at > io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104) > at > io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197) > at > io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128) > at > io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122) > at >
[jira] [Commented] (SPARK-29884) spark-submit to kuberentes can not parse valid ca certificate
[ https://issues.apache.org/jira/browse/SPARK-29884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974621#comment-16974621 ] Jeremy commented on SPARK-29884: After doing some debugging it seams like this might be in fabric k8s client. I tries to use .kube/config even if it gets all the parameters is needs from arguments. > spark-submit to kuberentes can not parse valid ca certificate > - > > Key: SPARK-29884 > URL: https://issues.apache.org/jira/browse/SPARK-29884 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.4 > Environment: A kuberentes cluster that has been in use for over 2 > years and handles large amounts of production payloads. >Reporter: Jeremy >Priority: Major > > spark submit can not be used to to schedule to kuberentes with oauth token > and cacert > {code:java} > spark-submit \ > --deploy-mode cluster \ > --class org.apache.spark.examples.SparkPi \ > --master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \ > --conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf > spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt > \ > --conf spark.kubernetes.namespace=here-olp-3dds-sit \ > --conf spark.executor.instances=1 \ > --conf spark.app.name=spark-pi \ > --conf > spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 > \ > --conf > spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 > \ > local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar > {code} > returns > {code:java} > log4j:WARN No appenders could be found for logger > (io.fabric8.kubernetes.client.Config). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53) > at > io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183) > at > org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.security.cert.CertificateException: Could not parse > certificate: java.io.IOException: Empty input > at > sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110) > at > java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339) > at > io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104) > at > io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197) > at > io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128) > at > io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122) > at > io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:78) > ... 13 more > Caused by: java.io.IOException: Empty input > at > sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:106) > ... 19 more > {code} > The cacert and
[jira] [Updated] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-28833: - Priority: Minor (was: Major) > Document ALTER VIEW Statement in SQL Reference. > --- > > Key: SPARK-28833 > URL: https://issues.apache.org/jira/browse/SPARK-28833 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: jobit mathew >Assignee: kevin yu >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-28833. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25573 [https://github.com/apache/spark/pull/25573] > Document ALTER VIEW Statement in SQL Reference. > --- > > Key: SPARK-28833 > URL: https://issues.apache.org/jira/browse/SPARK-28833 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: jobit mathew >Assignee: kevin yu >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-28833: Assignee: kevin yu > Document ALTER VIEW Statement in SQL Reference. > --- > > Key: SPARK-28833 > URL: https://issues.apache.org/jira/browse/SPARK-28833 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: jobit mathew >Assignee: kevin yu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29904) Parse timestamps in microsecond precision by JSON/CSV datasources
Maxim Gekk created SPARK-29904: -- Summary: Parse timestamps in microsecond precision by JSON/CSV datasources Key: SPARK-29904 URL: https://issues.apache.org/jira/browse/SPARK-29904 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Maxim Gekk Currently, Spark can parse strings with timestamps from JSON/CSV in millisecond precision. Internally, timestamps have microsecond precision. The ticket aims to modify parsing logic in Spark 2.4 to support the microsecond precision. Porting of DateFormatter/TimestampFormatter from Spark 3.0-preview is risky, so, need to find another lighter solution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29903) Add documentation for recursiveFileLookup
[ https://issues.apache.org/jira/browse/SPARK-29903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974577#comment-16974577 ] Nicholas Chammas commented on SPARK-29903: -- cc [~cloud_fan] and [~weichenxu123] > Add documentation for recursiveFileLookup > - > > Key: SPARK-29903 > URL: https://issues.apache.org/jira/browse/SPARK-29903 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively > loading data from a source directory. There is currently no documentation for > this option. > We should document this both for the DataFrame API as well as for SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29903) Add documentation for recursiveFileLookup
Nicholas Chammas created SPARK-29903: Summary: Add documentation for recursiveFileLookup Key: SPARK-29903 URL: https://issues.apache.org/jira/browse/SPARK-29903 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.0.0 Reporter: Nicholas Chammas SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively loading data from a source directory. There is currently no documentation for this option. We should document this both for the DataFrame API as well as for SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29902) Add listener event queue capacity configuration to documentation
shahid created SPARK-29902: -- Summary: Add listener event queue capacity configuration to documentation Key: SPARK-29902 URL: https://issues.apache.org/jira/browse/SPARK-29902 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.0.0 Reporter: shahid Add listener event queue capacity configuration to documentation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)
[ https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974527#comment-16974527 ] John Bauer commented on SPARK-29691: [[SPARK-29691] ensure Param objects are valid in fit, transform|https://github.com/apache/spark/pull/26527] > Estimator fit method fails to copy params (in PySpark) > -- > > Key: SPARK-29691 > URL: https://issues.apache.org/jira/browse/SPARK-29691 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: John Bauer >Priority: Minor > > Estimator `fit` method is supposed to copy a dictionary of params, > overwriting the estimator's previous values, before fitting the model. > However, the parameter values are not updated. This was observed in PySpark, > but may be present in the Java objects, as the PySpark code appears to be > functioning correctly. (The copy method that interacts with Java is > actually implemented in Params.) > For example, this prints > Before: 0.8 > After: 0.8 > but After should be 0.75 > {code:python} > from pyspark.ml.classification import LogisticRegression > # Load training data > training = spark \ > .read \ > .format("libsvm") \ > .load("data/mllib/sample_multiclass_classification_data.txt") > lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) > print("Before:", lr.getOrDefault("elasticNetParam")) > # Fit the model, but with an updated parameter setting: > lrModel = lr.fit(training, params={"elasticNetParam" : 0.75}) > print("After:", lr.getOrDefault("elasticNetParam")) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29672) update spark testing framework to use python3
[ https://issues.apache.org/jira/browse/SPARK-29672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp resolved SPARK-29672. - Resolution: Fixed > update spark testing framework to use python3 > - > > Key: SPARK-29672 > URL: https://issues.apache.org/jira/browse/SPARK-29672 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344] > it's time, at least for 3.0+ to migrate the test execution framework to > python 3.6. > this PR ([https://github.com/apache/spark/pull/26330]) does all of the above. > after 3.0 is cut, we can then add python3.8 and drop python2.7 and pypy2.5.1 > test support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29672) update spark testing framework to use python3
[ https://issues.apache.org/jira/browse/SPARK-29672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp updated SPARK-29672: Description: python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344] it's time, at least for 3.0+ to migrate the test execution framework to python 3.6. this PR ([https://github.com/apache/spark/pull/26330]) does all of the above. after 3.0 is cut, we can then add python3.8 and drop python2.7 and pypy2.5.1 test support. was: python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344] it's time, at least for 3.0+ to remove python 2.7 test support and migrate the test execution framework to python 3.6. this PR ([https://github.com/apache/spark/pull/26330]) does all of the above. > update spark testing framework to use python3 > - > > Key: SPARK-29672 > URL: https://issues.apache.org/jira/browse/SPARK-29672 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344] > it's time, at least for 3.0+ to migrate the test execution framework to > python 3.6. > this PR ([https://github.com/apache/spark/pull/26330]) does all of the above. > after 3.0 is cut, we can then add python3.8 and drop python2.7 and pypy2.5.1 > test support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29672) update spark testing framework to use python3
[ https://issues.apache.org/jira/browse/SPARK-29672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp updated SPARK-29672: Summary: update spark testing framework to use python3 (was: remove python2 tests and test infra) > update spark testing framework to use python3 > - > > Key: SPARK-29672 > URL: https://issues.apache.org/jira/browse/SPARK-29672 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Shane Knapp >Priority: Major > > python 2.7 is EOL jan 1st 2020: [https://github.com/python/devguide/pull/344] > it's time, at least for 3.0+ to remove python 2.7 test support and migrate > the test execution framework to python 3.6. > this PR ([https://github.com/apache/spark/pull/26330]) does all of the above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29901) Fix broken links in SQL Reference
Huaxin Gao created SPARK-29901: -- Summary: Fix broken links in SQL Reference Key: SPARK-29901 URL: https://issues.apache.org/jira/browse/SPARK-29901 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Huaxin Gao Fix the broken links -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974456#comment-16974456 ] Ryan Blue commented on SPARK-29900: --- To be clear, we think this is going to be a breaking change, right? > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974436#comment-16974436 ] Burak Yavuz commented on SPARK-29900: - I definitely agree the behavior is very confusing here. (For example, you can saveAsTable into a table, while a temp table with the same name exists... Once you query the table, you get the temp table back). Can we post here the proposed behavior? > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974433#comment-16974433 ] Terry Kim commented on SPARK-29900: --- Yes. Thanks [~cloud_fan] > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29899) Can not recursively lookup files in Hive table via SQL
[ https://issues.apache.org/jira/browse/SPARK-29899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29899: --- Description: SPARK-27990 provide a way to recursively load data from datasource. In SQL, when query a hive table, this property passed by the `relation.tableMeta.properties`. But it is filtered out now. So we can not lookup file recursively for a Hive table. (was: SPARK-27990 provide a way to recursively load data from datasource. In SQL, this property passed by the `relation.tableMeta.properties`. But in Parquet file format, it is filtered out. So we can not lookup file recursively for a table.) > Can not recursively lookup files in Hive table via SQL > -- > > Key: SPARK-29899 > URL: https://issues.apache.org/jira/browse/SPARK-29899 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > SPARK-27990 provide a way to recursively load data from datasource. In SQL, > when query a hive table, this property passed by the > `relation.tableMeta.properties`. But it is filtered out now. So we can not > lookup file recursively for a Hive table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29899) Can not recursively lookup files in Hive table via SQL
[ https://issues.apache.org/jira/browse/SPARK-29899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29899: --- Summary: Can not recursively lookup files in Hive table via SQL (was: Can not set recursiveFileLookup property in SQL) > Can not recursively lookup files in Hive table via SQL > -- > > Key: SPARK-29899 > URL: https://issues.apache.org/jira/browse/SPARK-29899 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > SPARK-27990 provide a way to recursively load data from datasource. In SQL, > this property passed by the `relation.tableMeta.properties`. But in Parquet > file format, it is filtered out. So we can not lookup file recursively for a > table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29899) Can not set recursiveFileLookup property in SQL
[ https://issues.apache.org/jira/browse/SPARK-29899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-29899: --- Summary: Can not set recursiveFileLookup property in SQL (was: Can not set recursiveFileLookup property in TBLPROPERTIES if file format is Parquet) > Can not set recursiveFileLookup property in SQL > --- > > Key: SPARK-29899 > URL: https://issues.apache.org/jira/browse/SPARK-29899 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > SPARK-27990 provide a way to recursively load data from datasource. In SQL, > this property passed by the `relation.tableMeta.properties`. But in Parquet > file format, it is filtered out. So we can not lookup file recursively for a > table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974352#comment-16974352 ] Wenchen Fan commented on SPARK-29900: - [~imback82] do you want to drive it? also cc [~rdblue] [~brkyvz] [~dongjoon] > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29900: Description: Currently, Spark has 2 different relation resolution behaviors: 1. try to look up temp view first, then try table/persistent view. 2. try to look up table/persistent view. The first behavior is used in SELECT, INSERT and a few commands that support views, like DESC TABLE. The second behavior is used in most commands. It's confusing to have inconsistent relation resolution behaviors, and the benefit is super small. It's only useful when there are temp view and table with the same name, but users can easily use qualified table name to disambiguate. In postgres, the relation resolution behavior is consistent {code} cloud0fan=# create schema s1; CREATE SCHEMA cloud0fan=# SET search_path TO s1; SET cloud0fan=# create table s1.t (i int); CREATE TABLE cloud0fan=# insert into s1.t values (1); INSERT 0 1 # access table with qualified name cloud0fan=# select * from s1.t; i --- 1 (1 row) # access table with single name cloud0fan=# select * from t; i --- 1 (1 rows) # create a temp view with conflicting name cloud0fan=# create temp view t as select 2 as i; CREATE VIEW # same as spark, temp view has higher proirity during resolution cloud0fan=# select * from t; i --- 2 (1 row) # DROP TABLE also resolves temp view first cloud0fan=# drop table t; ERROR: "t" is not a table # DELETE also resolves temp view first cloud0fan=# delete from t where i = 0; ERROR: cannot delete from view "t" {code} was: Currently, Spark has 2 different relation resolution behaviors: 1. try to look up temp view first, then try table/persistent view. 2. try to look up table/persistent view. The first behavior is used in SELECT, INSERT and a few commands that support views, like DESC TABLE. The second behavior is used in most commands. It's confusing to have inconsistent relation resolution behaviors, and the benefit is super small. It's only useful when there are temp view and table with the same name, but users can easily use qualified table name to disambiguate. In postgres, the relation resolution behavior is consistent {code} cloud0fan=# create schema s1; CREATE SCHEMA cloud0fan=# SET search_path TO s1; SET cloud0fan=# create table s1.t (i int); CREATE TABLE cloud0fan=# insert into s1.t values (1); INSERT 0 1 # access table with qualified name cloud0fan=# select * from s1.t; i --- 1 (1 row) # access table with single name cloud0fan=# select * from t; i --- 1 2 (2 rows) # create a temp view with conflicting name cloud0fan=# create temp view t as select 2 as i; CREATE VIEW # same as spark, temp view has higher proirity during resolution cloud0fan=# select * from t; i --- 2 (1 row) # DROP TABLE also resolves temp view first cloud0fan=# drop table t; ERROR: "t" is not a table # DELETE also resolves temp view first cloud0fan=# delete from t where i = 0; ERROR: cannot delete from view "t" {code} > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code}
[jira] [Created] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
Wenchen Fan created SPARK-29900: --- Summary: make relation lookup behavior consistent within Spark SQL Key: SPARK-29900 URL: https://issues.apache.org/jira/browse/SPARK-29900 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Currently, Spark has 2 different relation resolution behaviors: 1. try to look up temp view first, then try table/persistent view. 2. try to look up table/persistent view. The first behavior is used in SELECT, INSERT and a few commands that support views, like DESC TABLE. The second behavior is used in most commands. It's confusing to have inconsistent relation resolution behaviors, and the benefit is super small. It's only useful when there are temp view and table with the same name, but users can easily use qualified table name to disambiguate. In postgres, the relation resolution behavior is consistent {code} cloud0fan=# create schema s1; CREATE SCHEMA cloud0fan=# SET search_path TO s1; SET cloud0fan=# create table s1.t (i int); CREATE TABLE cloud0fan=# insert into s1.t values (1); INSERT 0 1 # access table with qualified name cloud0fan=# select * from s1.t; i --- 1 (1 row) # access table with single name cloud0fan=# select * from t; i --- 1 2 (2 rows) # create a temp view with conflicting name cloud0fan=# create temp view t as select 2 as i; CREATE VIEW # same as spark, temp view has higher proirity during resolution cloud0fan=# select * from t; i --- 2 (1 row) # DROP TABLE also resolves temp view first cloud0fan=# drop table t; ERROR: "t" is not a table # DELETE also resolves temp view first cloud0fan=# delete from t where i = 0; ERROR: cannot delete from view "t" {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29899) Can not set recursiveFileLookup property in TBLPROPERTIES if file format is Parquet
Lantao Jin created SPARK-29899: -- Summary: Can not set recursiveFileLookup property in TBLPROPERTIES if file format is Parquet Key: SPARK-29899 URL: https://issues.apache.org/jira/browse/SPARK-29899 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Lantao Jin SPARK-27990 provide a way to recursively load data from datasource. In SQL, this property passed by the `relation.tableMeta.properties`. But in Parquet file format, it is filtered out. So we can not lookup file recursively for a table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types
[ https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos del Prado Mota updated SPARK-29898: -- Description: Extends options for the Spark Avro formatter allowing to use custom Avro logical types. At the moment only timestamp and decimal logical types are supported at Spark but Avro support any conversion that you could need. This change keep the default mappings and allow to add news. {{spark}} {{ .read}} {{ .format("avro")}} {{ .option("logicalTypeMapper", "org.example.CustomAvroLogicalCatalystMapper")}} {{ .load()}} Only you need is register your custom Avro logical type and then implement `AvroLogicalTypeCatalystMapper` was: Extends options for the Spark Avro formatter allowing to use custom Avro logical types. At the moment only timestamp and decimal logical types are supported at Spark but Avro support any conversion that you could need. This change keep the default mappings and allow to add news. {{spark}} {{ .read}} {{ .format("avro")}} {{ .option("logicalTypeMapper", "org.example.CustomAvroLogicalCatalystMapper")}} {{ .load()}} Only you need is register your custom Avro logical type and then implement `AvroLogicalTypeCatalystMapper` > Support Avro Custom Logical Types > - > > Key: SPARK-29898 > URL: https://issues.apache.org/jira/browse/SPARK-29898 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Carlos del Prado Mota >Priority: Major > > Extends options for the Spark Avro formatter allowing to use custom Avro > logical types. > At the moment only timestamp and decimal logical types are supported at Spark > but Avro support any conversion that you could need. This change keep the > default mappings and allow to add news. > {{spark}} > {{ .read}} > {{ .format("avro")}} > {{ .option("logicalTypeMapper", > "org.example.CustomAvroLogicalCatalystMapper")}} > {{ .load()}} > Only you need is register your custom Avro logical type and then implement > `AvroLogicalTypeCatalystMapper` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types
[ https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos del Prado Mota updated SPARK-29898: -- Description: Extends options for the Spark Avro formatter allowing to use custom Avro logical types. At the moment only timestamp and decimal logical types are supported at Spark but Avro support any conversion that you could need. This change keep the default mappings and allow to add news. {{spark}} {{ .read}} {{ .format("avro")}} {{ .option("logicalTypeMapper", "org.example.CustomAvroLogicalCatalystMapper")}} {{ .load()}} Only you need is register your custom Avro logical type and then implement `AvroLogicalTypeCatalystMapper` was: Extends options for the Spark Avro formatter allowing to use custom Avro logical types. At the moment only timestamp and decimal logical types are supported at Spark but Avro support any conversion that you could need. This change keep the default mappings and allow to add news. {{spark}} {{ .read}} {{ .format("avro") {{ .option("logicalTypeMapper", "org.example.CustomAvroLogicalCatalystMapper") {{ .load()}} Only you need is register your custom Avro logical type and then implement `AvroLogicalTypeCatalystMapper` > Support Avro Custom Logical Types > - > > Key: SPARK-29898 > URL: https://issues.apache.org/jira/browse/SPARK-29898 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Carlos del Prado Mota >Priority: Major > > Extends options for the Spark Avro formatter allowing to use custom Avro > logical types. > At the moment only timestamp and decimal logical types are supported at Spark > but Avro support any conversion that you could need. This change keep the > default mappings and allow to add news. > {{spark}} > {{ .read}} > {{ .format("avro")}} > {{ .option("logicalTypeMapper", > "org.example.CustomAvroLogicalCatalystMapper")}} > {{ .load()}} > Only you need is register your custom Avro logical type and then implement > `AvroLogicalTypeCatalystMapper` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types
[ https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos del Prado Mota updated SPARK-29898: -- Description: Extends options for the Spark Avro formatter allowing to use custom Avro logical types. At the moment only timestamp and decimal logical types are supported at Spark but Avro support any conversion that you could need. This change keep the default mappings and allow to add news. spark {{ {{ .read {{ {{ .format("avro") {{ {{ .option("logicalTypeMapper", "org.example.CustomAvroLogicalCatalystMapper") {{ {{ .load() Only you need is register your custom Avro logical type and then implement `AvroLogicalTypeCatalystMapper` was: Extends options for the Spark Avro formatter allowing to use custom Avro logical types. At the moment only timestamp and decimal logical types are supported at Spark but Avro support any conversion that you could need. This change keep the default mappings and allow to add news. {{spark}} {{ .read}} {{ .format("avro")}} {{ .option("logicalTypeMapper", "org.example.CustomAvroLogicalCatalystMapper")}} {{ .load()}} Only you need is register your custom Avro logical type and then implement `AvroLogicalTypeCatalystMapper` > Support Avro Custom Logical Types > - > > Key: SPARK-29898 > URL: https://issues.apache.org/jira/browse/SPARK-29898 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Carlos del Prado Mota >Priority: Major > > Extends options for the Spark Avro formatter allowing to use custom Avro > logical types. > At the moment only timestamp and decimal logical types are supported at Spark > but Avro support any conversion that you could need. This change keep the > default mappings and allow to add news. > spark > {{ {{ .read > {{ {{ .format("avro") > {{ {{ .option("logicalTypeMapper", > "org.example.CustomAvroLogicalCatalystMapper") > {{ {{ .load() > Only you need is register your custom Avro logical type and then implement > `AvroLogicalTypeCatalystMapper` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types
[ https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos del Prado Mota updated SPARK-29898: -- Description: Extends options for the Spark Avro formatter allowing to use custom Avro logical types. At the moment only timestamp and decimal logical types are supported at Spark but Avro support any conversion that you could need. This change keep the default mappings and allow to add news. {{spark}} {{ .read}} {{ .format("avro") {{ .option("logicalTypeMapper", "org.example.CustomAvroLogicalCatalystMapper") {{ .load()}} Only you need is register your custom Avro logical type and then implement `AvroLogicalTypeCatalystMapper` was: Extends options for the Spark Avro formatter allowing to use custom Avro logical types. At the moment only timestamp and decimal logical types are supported at Spark but Avro support any conversion that you could need. This change keep the default mappings and allow to add news. spark {{ {{ .read {{ {{ .format("avro") {{ {{ .option("logicalTypeMapper", "org.example.CustomAvroLogicalCatalystMapper") {{ {{ .load() Only you need is register your custom Avro logical type and then implement `AvroLogicalTypeCatalystMapper` > Support Avro Custom Logical Types > - > > Key: SPARK-29898 > URL: https://issues.apache.org/jira/browse/SPARK-29898 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Carlos del Prado Mota >Priority: Major > > Extends options for the Spark Avro formatter allowing to use custom Avro > logical types. > At the moment only timestamp and decimal logical types are supported at Spark > but Avro support any conversion that you could need. This change keep the > default mappings and allow to add news. > {{spark}} > {{ .read}} > {{ .format("avro") > {{ .option("logicalTypeMapper", > "org.example.CustomAvroLogicalCatalystMapper") > {{ .load()}} > Only you need is register your custom Avro logical type and then implement > `AvroLogicalTypeCatalystMapper` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types
[ https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos del Prado Mota updated SPARK-29898: -- Description: Extends options for the Spark Avro formatter allowing to use custom Avro logical types. At the moment only timestamp and decimal logical types are supported at Spark but Avro support any conversion that you could need. This change keep the default mappings and allow to add news. {{spark}} {{ .read}} {{ .format("avro")}} {{ .option("logicalTypeMapper", "org.example.CustomAvroLogicalCatalystMapper")}} {{ .load()}} Only you need is register your custom Avro logical type and then implement `AvroLogicalTypeCatalystMapper` was: Extends options for the Spark Avro formatter allowing to use custom Avro logical types. At the moment only timestamp and decimal logical types are supported at Spark but Avro support any conversion that you could need. This change keep the default mappings and allow to add news. spark {{ {{ .read {{ {{ .format("avro") {{ {{ .option("logicalTypeMapper", "org.example.CustomAvroLogicalCatalystMapper") {{ .load()}} Only you need is register your custom Avro logical type and then implement `AvroLogicalTypeCatalystMapper` > Support Avro Custom Logical Types > - > > Key: SPARK-29898 > URL: https://issues.apache.org/jira/browse/SPARK-29898 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Carlos del Prado Mota >Priority: Major > > Extends options for the Spark Avro formatter allowing to use custom Avro > logical types. > At the moment only timestamp and decimal logical types are supported at Spark > but Avro support any conversion that you could need. This change keep the > default mappings and allow to add news. > {{spark}} > {{ .read}} > {{ .format("avro")}} > {{ .option("logicalTypeMapper", > "org.example.CustomAvroLogicalCatalystMapper")}} > {{ .load()}} > Only you need is register your custom Avro logical type and then implement > `AvroLogicalTypeCatalystMapper` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29898) Support Avro Custom Logical Types
[ https://issues.apache.org/jira/browse/SPARK-29898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos del Prado Mota updated SPARK-29898: -- Description: Extends options for the Spark Avro formatter allowing to use custom Avro logical types. At the moment only timestamp and decimal logical types are supported at Spark but Avro support any conversion that you could need. This change keep the default mappings and allow to add news. spark {{ {{ .read {{ {{ .format("avro") {{ {{ .option("logicalTypeMapper", "org.example.CustomAvroLogicalCatalystMapper") {{ .load()}} Only you need is register your custom Avro logical type and then implement `AvroLogicalTypeCatalystMapper` > Support Avro Custom Logical Types > - > > Key: SPARK-29898 > URL: https://issues.apache.org/jira/browse/SPARK-29898 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Carlos del Prado Mota >Priority: Major > > Extends options for the Spark Avro formatter allowing to use custom Avro > logical types. > At the moment only timestamp and decimal logical types are supported at Spark > but Avro support any conversion that you could need. This change keep the > default mappings and allow to add news. > spark > {{ {{ .read > {{ {{ .format("avro") > {{ {{ .option("logicalTypeMapper", > "org.example.CustomAvroLogicalCatalystMapper") > {{ .load()}} > Only you need is register your custom Avro logical type and then implement > `AvroLogicalTypeCatalystMapper` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29898) Support Avro Custom Logical Types
Carlos del Prado Mota created SPARK-29898: - Summary: Support Avro Custom Logical Types Key: SPARK-29898 URL: https://issues.apache.org/jira/browse/SPARK-29898 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Carlos del Prado Mota -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29897) Implicit cast to timestamp is failing
ABHISHEK KUMAR GUPTA created SPARK-29897: Summary: Implicit cast to timestamp is failing Key: SPARK-29897 URL: https://issues.apache.org/jira/browse/SPARK-29897 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA Spark can not cast implicitly jdbc:hive2://10.18.19.208:23040/default> SELECT EXTRACT(DAY FROM NOW() - '2014-08-02 08:10:56'); Error: org.apache.spark.sql.AnalysisException: cannot resolve '(current_timestamp() - CAST('2014-08-02 08:10:56' AS DOUBLE))' due to data type mismatch: differing types in '(current_timestamp() - CAST('2014-08-02 08:10:56' AS DOUBLE))' (timestamp and double).; line 1 pos 24; PostgreSQL and MySQL can handle the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20110) Windowed aggregation do not work when the timestamp is a nested field
[ https://issues.apache.org/jira/browse/SPARK-20110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974177#comment-16974177 ] hurelhuyag commented on SPARK-20110: I just faced same problem now. It's spark version 2.4.4. I don't understand what's difference. 2 query doing same thing. If first is wrong then second should wrong. > Windowed aggregation do not work when the timestamp is a nested field > - > > Key: SPARK-20110 > URL: https://issues.apache.org/jira/browse/SPARK-20110 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.0 >Reporter: Alexis Seigneurin >Priority: Major > Labels: bulk-closed > > I am loading data into a DataFrame with nested fields. I want to perform a > windowed aggregation on the timestamp from a nested fields: > {code} > .groupBy(window($"auth.sysEntryTimestamp", "2 minutes")) > {code} > I get the following error: > {quote} > org.apache.spark.sql.AnalysisException: Multiple time window expressions > would result in a cartesian product of rows, therefore they are not currently > not supported. > {quote} > This works fine if I first extract the timestamp to a separate column: > {code} > .withColumn("sysEntryTimestamp", $"auth.sysEntryTimestamp") > .groupBy( > window($"sysEntryTimestamp", "2 minutes") > ) > {code} > Please see the whole sample: > - batch: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363210/3769253384867782/latest.html > - Structured Streaming: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363192/3769253384867782/latest.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20110) Windowed aggregation do not work when the timestamp is a nested field
[ https://issues.apache.org/jira/browse/SPARK-20110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974177#comment-16974177 ] hurelhuyag edited comment on SPARK-20110 at 11/14/19 12:09 PM: --- I just faced same problem. It's spark version 2.4.4. I don't understand what's difference. 2 query doing same thing. If first is wrong then second should wrong. was (Author: hurelhuyag): I just faced same problem now. It's spark version 2.4.4. I don't understand what's difference. 2 query doing same thing. If first is wrong then second should wrong. > Windowed aggregation do not work when the timestamp is a nested field > - > > Key: SPARK-20110 > URL: https://issues.apache.org/jira/browse/SPARK-20110 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.0 >Reporter: Alexis Seigneurin >Priority: Major > Labels: bulk-closed > > I am loading data into a DataFrame with nested fields. I want to perform a > windowed aggregation on the timestamp from a nested fields: > {code} > .groupBy(window($"auth.sysEntryTimestamp", "2 minutes")) > {code} > I get the following error: > {quote} > org.apache.spark.sql.AnalysisException: Multiple time window expressions > would result in a cartesian product of rows, therefore they are not currently > not supported. > {quote} > This works fine if I first extract the timestamp to a separate column: > {code} > .withColumn("sysEntryTimestamp", $"auth.sysEntryTimestamp") > .groupBy( > window($"sysEntryTimestamp", "2 minutes") > ) > {code} > Please see the whole sample: > - batch: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363210/3769253384867782/latest.html > - Structured Streaming: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4683710270868386/4278399007363192/3769253384867782/latest.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29896) Extend typed literals support for all spark native types
Kent Yao created SPARK-29896: Summary: Extend typed literals support for all spark native types Key: SPARK-29896 URL: https://issues.apache.org/jira/browse/SPARK-29896 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao Currently, Date, Timestamp, Interval, Binary, and INTEGER typed literals are supported. We should support other native datatypes for this feature. {code:sql} +-- typed literals +-- boolean +select boolean 'true'; +select boolean 'false'; +select boolean 't'; +select boolean 'f'; +select boolean 'yes'; +select boolean 'no'; +select -boolean 'true'; + +-- byte +select tinyint '1'; +select tinyint '-1'; +select tinyint '128'; +select byte '1'; +select -tinyint '1'; + +-- short +select smallint '1'; +select smallint '-1'; +select smallint '32768'; +select short '1'; +select -smallint '1'; + +-- long +select long '1'; +select bigint '-1'; +select -bigint '1'; + +-- float/double +select float '1'; +select -float '-1'; +select double '1'; +select -double '1'; + +-- hive string type +select char(10) '12345'; +select varchar(10) '12345'; + +-- binary +select binary '12345'; + +-- decimal +select decimal '1.01'; +select decimal(10, 2) '11.1'; +select decimal(2, 0) '11.1'; +select decimal(2, 1) '11.1'; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29895) Extend typed literals support for all spark native types
Kent Yao created SPARK-29895: Summary: Extend typed literals support for all spark native types Key: SPARK-29895 URL: https://issues.apache.org/jira/browse/SPARK-29895 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao Currently, Date, Timestamp, Interval, Binary, and INTEGER typed literals are supported. We should support other native datatypes for this feature. {code:sql} +-- typed literals +-- boolean +select boolean 'true'; +select boolean 'false'; +select boolean 't'; +select boolean 'f'; +select boolean 'yes'; +select boolean 'no'; +select -boolean 'true'; + +-- byte +select tinyint '1'; +select tinyint '-1'; +select tinyint '128'; +select byte '1'; +select -tinyint '1'; + +-- short +select smallint '1'; +select smallint '-1'; +select smallint '32768'; +select short '1'; +select -smallint '1'; + +-- long +select long '1'; +select bigint '-1'; +select -bigint '1'; + +-- float/double +select float '1'; +select -float '-1'; +select double '1'; +select -double '1'; + +-- hive string type +select char(10) '12345'; +select varchar(10) '12345'; + +-- binary +select binary '12345'; + +-- decimal +select decimal '1.01'; +select decimal(10, 2) '11.1'; +select decimal(2, 0) '11.1'; +select decimal(2, 1) '11.1'; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21040) On executor/worker decommission consider speculatively re-launching current tasks
[ https://issues.apache.org/jira/browse/SPARK-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974162#comment-16974162 ] Prakhar Jain commented on SPARK-21040: -- Hi [~holden], At Microsoft, we are also facing same issues while adding support for low-priority VMs and we are working on similar lines. We have considered following options: Option 1) Whenever an executor goes to decommissioning state, we can consider all the tasks that are running on that executor for speculation (without worrying about "spark.speculation.quantile" or "spark.speculation.multiplier") Option 2) Whenever an executor goes to decommissioning state, Check the following for each task running on that executor - Check if X% of tasks have finished in the corresponding stage and identify the median time - if (MedianTime - RunTimeOfTaskInConsideration) > cloud_threshold then consider the task for speculation. cloud_threshold can be set as a configuration parameter (Ex. 120 seconds for aws spot instances etc) What are your thoughts on the same? > On executor/worker decommission consider speculatively re-launching current > tasks > - > > Key: SPARK-21040 > URL: https://issues.apache.org/jira/browse/SPARK-21040 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > > If speculative execution is enabled we may wish to consider decommissioning > of worker as a weight for speculative execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974143#comment-16974143 ] huangtianhua commented on SPARK-29106: -- [~shaneknapp], the vm is ready, I have build/test in /home/jenkins/spark, and because the image of old arm testing instance is too large, so we can't create the new instance with the image, we copy the contents of /home/jenkins/ into new instance. And because of the network performance, we cache the local source some about "hive-ivy" into /home/jenkins/hive-ivy-cache, please export the environment {color:#de350b}SPARK_VERSIONS_SUITE_IVY_PATH=/home/jenkins/hive-ivy-cache/{color} before maven test. I will send the details info of the vm to your email later. Please add it as worker of amplab jenkins, and try to build the two jobs as we did before, don't hesitate to contact us if you have any questions, thanks very much. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29894) Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab
[ https://issues.apache.org/jira/browse/SPARK-29894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-29894: Attachment: snippet_plan_graph_before_patch.png > Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab > --- > > Key: SPARK-29894 > URL: https://issues.apache.org/jira/browse/SPARK-29894 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > Attachments: snippet__plan_graph_with_Codegen_Stage_Id_Annotated.png, > snippet_plan_graph_before_patch.png > > > The Web UI SQL Tab provides information on the executed SQL using plan graphs > and SQL execution plans. Both provide useful information. Physical execution > plans report the Codegen Stage Id. It is useful to have Codegen Stage Id also > reported in the plan graphs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29894) Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab
[ https://issues.apache.org/jira/browse/SPARK-29894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-29894: Attachment: snippet__plan_graph_with_Codegen_Stage_Id_Annotated.png > Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab > --- > > Key: SPARK-29894 > URL: https://issues.apache.org/jira/browse/SPARK-29894 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > Attachments: snippet__plan_graph_with_Codegen_Stage_Id_Annotated.png > > > The Web UI SQL Tab provides information on the executed SQL using plan graphs > and SQL execution plans. Both provide useful information. Physical execution > plans report the Codegen Stage Id. It is useful to have Codegen Stage Id also > reported in the plan graphs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29894) Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab
Luca Canali created SPARK-29894: --- Summary: Add Codegen Stage Id to Spark plan graphs in Web UI SQL Tab Key: SPARK-29894 URL: https://issues.apache.org/jira/browse/SPARK-29894 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 3.0.0 Reporter: Luca Canali The Web UI SQL Tab provides information on the executed SQL using plan graphs and SQL execution plans. Both provide useful information. Physical execution plans report the Codegen Stage Id. It is useful to have Codegen Stage Id also reported in the plan graphs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29893) Improve the local reader performance by changing the task number from 1 to multi
Ke Jia created SPARK-29893: -- Summary: Improve the local reader performance by changing the task number from 1 to multi Key: SPARK-29893 URL: https://issues.apache.org/jira/browse/SPARK-29893 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ke Jia The currently local reader read all the partition of map stage only using 1 task, which may cause the performance degradation. This PR will improve the performance by using multi tasks instead of one task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29892) Add built-in Array Functions: array_cat
[ https://issues.apache.org/jira/browse/SPARK-29892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-29892: --- Description: |{{array_cat}}{{(}}{{anyarray}}{{, }}{{anyarray}}{{)}}|{{anyarray}}|concatenate two arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{\{ {1,2,3,4,5}}}| Other DBs: [https://phoenix.apache.org/language/functions.html#array_cat] was:|{{array_cat}}{{(}}{{anyarray}}{{, }}{{anyarray}}{{)}}|{{anyarray}}|concatenate two arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{{{1,2,3,4,5}}}| > Add built-in Array Functions: array_cat > --- > > Key: SPARK-29892 > URL: https://issues.apache.org/jira/browse/SPARK-29892 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > |{{array_cat}}{{(}}{{anyarray}}{{, > }}{{anyarray}}{{)}}|{{anyarray}}|concatenate two > arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{\{ {1,2,3,4,5}}}| > Other DBs: > [https://phoenix.apache.org/language/functions.html#array_cat] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29892) Add built-in Array Functions: array_cat
[ https://issues.apache.org/jira/browse/SPARK-29892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974089#comment-16974089 ] jiaan.geng commented on SPARK-29892: I'm working on. > Add built-in Array Functions: array_cat > --- > > Key: SPARK-29892 > URL: https://issues.apache.org/jira/browse/SPARK-29892 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > |{{array_cat}}{{(}}{{anyarray}}{{, > }}{{anyarray}}{{)}}|{{anyarray}}|concatenate two > arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{{{1,2,3,4,5}}}| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29892) Add built-in Array Functions: array_cat
jiaan.geng created SPARK-29892: -- Summary: Add built-in Array Functions: array_cat Key: SPARK-29892 URL: https://issues.apache.org/jira/browse/SPARK-29892 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: jiaan.geng |{{array_cat}}{{(}}{{anyarray}}{{, }}{{anyarray}}{{)}}|{{anyarray}}|concatenate two arrays|{{array_cat(ARRAY[1,2,3], ARRAY[4,5])}}|{{{1,2,3,4,5}}}| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29890) Unable to fill na with 0 with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974087#comment-16974087 ] sandeshyapuram edited comment on SPARK-29890 at 11/14/19 9:36 AM: -- I've raised it as a bug because I feel fill.na(0) needs to fill 0 regardless of duplicate column names. [~cloud_fan] Thoughts was (Author: sandeshyapuram): I've raised it as a bug because I feel fill.na(0) needs to fill 0 regardless of duplicate column names. > Unable to fill na with 0 with duplicate columns > --- > > Key: SPARK-29890 > URL: https://issues.apache.org/jira/browse/SPARK-29890 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.3 >Reporter: sandeshyapuram >Priority: Major > > Trying to fill out na values with 0. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > val parent = > spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc") > val c1 = parent.filter(lit(true)) > val c2 = parent.filter(lit(true)) > c1.join(c2, Seq("nums"), "left") > .na.fill(0).show{noformat} > {noformat} > 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: > error looking up the name of group 820818257: No such file or directory > org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could > be: abc, abc.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1246) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) > ... 54 elided{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29890) Unable to fill na with 0 with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974087#comment-16974087 ] sandeshyapuram commented on SPARK-29890: I've raised it as a bug because I feel fill.na(0) needs to fill 0 regardless of duplicate column names. > Unable to fill na with 0 with duplicate columns > --- > > Key: SPARK-29890 > URL: https://issues.apache.org/jira/browse/SPARK-29890 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.3 >Reporter: sandeshyapuram >Priority: Major > > Trying to fill out na values with 0. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > val parent = > spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc") > val c1 = parent.filter(lit(true)) > val c2 = parent.filter(lit(true)) > c1.join(c2, Seq("nums"), "left") > .na.fill(0).show{noformat} > {noformat} > 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: > error looking up the name of group 820818257: No such file or directory > org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could > be: abc, abc.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1246) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) > ... 54 elided{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29891) Add built-in Array Functions: array_length
[ https://issues.apache.org/jira/browse/SPARK-29891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974086#comment-16974086 ] jiaan.geng commented on SPARK-29891: I'm working on > Add built-in Array Functions: array_length > -- > > Key: SPARK-29891 > URL: https://issues.apache.org/jira/browse/SPARK-29891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > |{{array_length}}{{(}}{{anyarray}}{{, }}{{int}}{{)}}|{{int}}|returns the > length of the requested array dimension|{{array_length(array[1,2,3], > 1)}}|{{3}}| > | | | | | | > Other DBs: > [https://phoenix.apache.org/language/functions.html#array_length] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29891) Add built-in Array Functions: array_length
[ https://issues.apache.org/jira/browse/SPARK-29891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-29891: --- Description: |{{array_length}}{{(}}{{anyarray}}{{, }}{{int}}{{)}}|{{int}}|returns the length of the requested array dimension|{{array_length(array[1,2,3], 1)}}|{{3}}| | | | | | | Other DBs: [https://phoenix.apache.org/language/functions.html#array_length] was: |{{array_length}}{{(}}{{anyarray}}{{, }}{{int}}{{)}}|{{int}}|returns the length of the requested array dimension|{{array_length(array[1,2,3], 1)}}|{{3}}| | | | | | | > Add built-in Array Functions: array_length > -- > > Key: SPARK-29891 > URL: https://issues.apache.org/jira/browse/SPARK-29891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > |{{array_length}}{{(}}{{anyarray}}{{, }}{{int}}{{)}}|{{int}}|returns the > length of the requested array dimension|{{array_length(array[1,2,3], > 1)}}|{{3}}| > | | | | | | > Other DBs: > [https://phoenix.apache.org/language/functions.html#array_length] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org