[jira] [Created] (SPARK-32585) Support scala enumeration in ScalaReflection
ulysses you created SPARK-32585: --- Summary: Support scala enumeration in ScalaReflection Key: SPARK-32585 URL: https://issues.apache.org/jira/browse/SPARK-32585 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: ulysses you -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build
[ https://issues.apache.org/jira/browse/SPARK-32584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175237#comment-17175237 ] Apache Spark commented on SPARK-32584: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29402 > Exclude _images and _sources that are generated by Sphinx in Jekyll build > - > > Key: SPARK-32584 > URL: https://issues.apache.org/jira/browse/SPARK-32584 > Project: Spark > Issue Type: Documentation > Components: docs >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Attachments: Screen Shot 2020-08-11 at 1.52.46 PM.png > > > For _images directory, after SPARK-31851, now we added some images to use > within the pages built by Sphinx. It copies and images into _images > directory. Later when Jeykill builds, the underscore directories are ignored > by default which ends up with missing image in the main doc. I attached the > image in the JIRA. > For _sources directory, see > https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build
[ https://issues.apache.org/jira/browse/SPARK-32584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32584: Assignee: Apache Spark > Exclude _images and _sources that are generated by Sphinx in Jekyll build > - > > Key: SPARK-32584 > URL: https://issues.apache.org/jira/browse/SPARK-32584 > Project: Spark > Issue Type: Documentation > Components: docs >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > Attachments: Screen Shot 2020-08-11 at 1.52.46 PM.png > > > For _images directory, after SPARK-31851, now we added some images to use > within the pages built by Sphinx. It copies and images into _images > directory. Later when Jeykill builds, the underscore directories are ignored > by default which ends up with missing image in the main doc. I attached the > image in the JIRA. > For _sources directory, see > https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build
[ https://issues.apache.org/jira/browse/SPARK-32584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32584: Assignee: (was: Apache Spark) > Exclude _images and _sources that are generated by Sphinx in Jekyll build > - > > Key: SPARK-32584 > URL: https://issues.apache.org/jira/browse/SPARK-32584 > Project: Spark > Issue Type: Documentation > Components: docs >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Attachments: Screen Shot 2020-08-11 at 1.52.46 PM.png > > > For _images directory, after SPARK-31851, now we added some images to use > within the pages built by Sphinx. It copies and images into _images > directory. Later when Jeykill builds, the underscore directories are ignored > by default which ends up with missing image in the main doc. I attached the > image in the JIRA. > For _sources directory, see > https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build
[ https://issues.apache.org/jira/browse/SPARK-32584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32584: - Description: For _images directory, after SPARK-31851, now we added some images to use within the pages built by Sphinx. It copies and images into _images directory. Later when Jeykill builds, the underscore directories are ignored by default which ends up with missing image in the main doc. I attached the image in the JIRA. For _sources directory, see https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links. was: After SPARK-31851, now we added some images to use within the pages built by Sphinx. It copies and images into _images directory. Later when Jeykill builds, the underscore directories are ignored by default which ends up with missing image in the main doc. I attached the image in the JIRA. > Exclude _images and _sources that are generated by Sphinx in Jekyll build > - > > Key: SPARK-32584 > URL: https://issues.apache.org/jira/browse/SPARK-32584 > Project: Spark > Issue Type: Documentation > Components: docs >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Attachments: Screen Shot 2020-08-11 at 1.52.46 PM.png > > > For _images directory, after SPARK-31851, now we added some images to use > within the pages built by Sphinx. It copies and images into _images > directory. Later when Jeykill builds, the underscore directories are ignored > by default which ends up with missing image in the main doc. I attached the > image in the JIRA. > For _sources directory, see > https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build
[ https://issues.apache.org/jira/browse/SPARK-32584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32584: - Attachment: Screen Shot 2020-08-11 at 1.52.46 PM.png > Exclude _images and _sources that are generated by Sphinx in Jekyll build > - > > Key: SPARK-32584 > URL: https://issues.apache.org/jira/browse/SPARK-32584 > Project: Spark > Issue Type: Documentation > Components: docs >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Attachments: Screen Shot 2020-08-11 at 1.52.46 PM.png > > > After SPARK-31851, now we added some images to use within the pages built by > Sphinx. It copies and images into _images directory. Later when Jeykill > builds, the underscore directories are ignored by default which ends up with > missing image in the main doc. > I attached the image in the JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build
Hyukjin Kwon created SPARK-32584: Summary: Exclude _images and _sources that are generated by Sphinx in Jekyll build Key: SPARK-32584 URL: https://issues.apache.org/jira/browse/SPARK-32584 Project: Spark Issue Type: Documentation Components: docs Affects Versions: 3.1.0 Reporter: Hyukjin Kwon After SPARK-31851, now we added some images to use within the pages built by Sphinx. It copies and images into _images directory. Later when Jeykill builds, the underscore directories are ignored by default which ends up with missing image in the main doc. I attached the image in the JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32580) Issue accessing a column values after 'explode' function
[ https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175219#comment-17175219 ] Ayrat Sadreev commented on SPARK-32580: --- If you select values from one of the arrays, it works {code:scala} scala> val df5 = df3.select(df3.col("value.sample")) df5: org.apache.spark.sql.DataFrame = [sample: decimal(1,1)] scala> df5.show() +--+ |sample| +--+ | 0.1| +--+ scala> val df6 = df3.select(df3.col("item.item_id")) df6: org.apache.spark.sql.DataFrame = [item_id: int] scala> df6.show() +---+ |item_id| +---+ | 1| +---+ {code} > Issue accessing a column values after 'explode' function > > > Key: SPARK-32580 > URL: https://issues.apache.org/jira/browse/SPARK-32580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ayrat Sadreev >Priority: Major > Attachments: ExplodeTest.java, data.json > > > An exception occurs when trying to flatten double nested arrays > The schema is > {code:none} > root > |-- data: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- item_id: string (nullable = true) > |||-- timestamp: string (nullable = true) > |||-- values: array (nullable = true) > ||||-- element: struct (containsNull = true) > |||||-- sample: double (nullable = true) > {code} > The target schema is > {code:none} > root > |-- item_id: string (nullable = true) > |-- timestamp: string (nullable = true) > |-- sample: double (nullable = true) > {code} > The code (in Java) > {code:java} > package com.skf.streamer.spark; > import java.util.concurrent.TimeoutException; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.functions; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > public class ExplodeTest { >public static void main(String[] args) throws TimeoutException { > SparkConf conf = new SparkConf() > .setAppName("SimpleApp") > .set("spark.scheduler.mode", "FAIR") > .set("spark.master", "local[1]") > .set("spark.sql.streaming.checkpointLocation", "checkpoint"); > SparkSession spark = SparkSession.builder() > .config(conf) > .getOrCreate(); > Dataset d0 = spark > .read() > .format("json") > .option("multiLine", "true") > .schema(getSchema()) > .load("src/test/resources/explode/data.json"); > d0.printSchema(); > d0 = d0.withColumn("item", functions.explode(d0.col("data"))); > d0 = d0.withColumn("value", functions.explode(d0.col("item.values"))); > d0.printSchema(); > d0 = d0.select( > d0.col("item.item_id"), > d0.col("item.timestamp"), > d0.col("value.sample") > ); > d0.printSchema(); > d0.show(); // Failes > spark.stop(); >} >private static StructType getSchema() { > StructField[] level2Fields = { > DataTypes.createStructField("sample", DataTypes.DoubleType, false), > }; > StructField[] level1Fields = { > DataTypes.createStructField("item_id", DataTypes.StringType, false), > DataTypes.createStructField("timestamp", DataTypes.StringType, > false), > DataTypes.createStructField("values", > DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false) > }; > StructField[] fields = { > DataTypes.createStructField("data", > DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false) > }; > return DataTypes.createStructType(fields); >} > } > {code} > The data file > {code:json} > { > "data": [ > { > "item_id": "item_1", > "timestamp": "2020-07-01 12:34:89", > "values": [ > { > "sample": 1.1 > }, > { > "sample": 1.2 > } > ] > }, > { > "item_id": "item_2", > "timestamp": "2020-07-02 12:34:89", > "values": [ > { > "sample": 2.2 > } > ] > } > ] > } > {code} > Dataset.show() method fails with an exception > {code:none} > Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in > [_gen_alias_28#28,_gen_alias_29#29] > at scala.sys.package$.error(package.scala:30) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) >
[jira] [Resolved] (SPARK-32583) PySpark Structured Streaming Testing Support
[ https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32583. -- Resolution: Invalid > PySpark Structured Streaming Testing Support > > > Key: SPARK-32583 > URL: https://issues.apache.org/jira/browse/SPARK-32583 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello, > I have investigated a lot but couldn't get any help or resource on how to > +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ > (ingesting from Kafka topics to S3) and how to build Continuous Integration > (CI)/Continuous Deployment (CD). > 1. Is it possible to test (unit test, integration test) pyspark structured > stream? > 2. How to build Continuous Integration (CI)/Continuous Deployment (CD)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-32583) PySpark Structured Streaming Testing Support
[ https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-32583: -- > PySpark Structured Streaming Testing Support > > > Key: SPARK-32583 > URL: https://issues.apache.org/jira/browse/SPARK-32583 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello, > I have investigated a lot but couldn't get any help or resource on how to > +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ > (ingesting from Kafka topics to S3) and how to build Continuous Integration > (CI)/Continuous Deployment (CD). > 1. Is it possible to test (unit test, integration test) pyspark structured > stream? > 2. How to build Continuous Integration (CI)/Continuous Deployment (CD)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support
[ https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175212#comment-17175212 ] Hyukjin Kwon commented on SPARK-32583: -- [~FelixKJose], let's interact in the mailing list or stackoverflow before filing it as an issue. I am sure you could have a better answer there. This looks like definitely not an issue in Spark. You can take a look for the main code of Spark and see how the tests are written. > PySpark Structured Streaming Testing Support > > > Key: SPARK-32583 > URL: https://issues.apache.org/jira/browse/SPARK-32583 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello, > I have investigated a lot but couldn't get any help or resource on how to > +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ > (ingesting from Kafka topics to S3) and how to build Continuous Integration > (CI)/Continuous Deployment (CD). > 1. Is it possible to test (unit test, integration test) pyspark structured > stream? > 2. How to build Continuous Integration (CI)/Continuous Deployment (CD)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support
[ https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175205#comment-17175205 ] Rohit Mishra commented on SPARK-32583: -- cc [~hyukjin.kwon]. > PySpark Structured Streaming Testing Support > > > Key: SPARK-32583 > URL: https://issues.apache.org/jira/browse/SPARK-32583 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello, > I have investigated a lot but couldn't get any help or resource on how to > +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ > (ingesting from Kafka topics to S3) and how to build Continuous Integration > (CI)/Continuous Deployment (CD). > 1. Is it possible to test (unit test, integration test) pyspark structured > stream? > 2. How to build Continuous Integration (CI)/Continuous Deployment (CD)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175203#comment-17175203 ] Lantao Jin commented on SPARK-32582: Maybe we could offer a new interface to break out in one iteration when mergeSchema is false. I am not sure. {code} def inferSchema( sparkSession: SparkSession, options: Map[String, String], f: (FileIndex) => Seq[FileStatus]): Option[StructType] {code} Do you already have any fixing? PR is welcome. > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support
[ https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175202#comment-17175202 ] Felix Kizhakkel Jose commented on SPARK-32583: -- [~rohitmishr1484] It would have been nice if you could have pointed me to a solution. > PySpark Structured Streaming Testing Support > > > Key: SPARK-32583 > URL: https://issues.apache.org/jira/browse/SPARK-32583 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello, > I have investigated a lot but couldn't get any help or resource on how to > +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ > (ingesting from Kafka topics to S3) and how to build Continuous Integration > (CI)/Continuous Deployment (CD). > 1. Is it possible to test (unit test, integration test) pyspark structured > stream? > 2. How to build Continuous Integration (CI)/Continuous Deployment (CD)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32583) PySpark Structured Streaming Testing Support
[ https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohit Mishra resolved SPARK-32583. -- Resolution: Not A Problem > PySpark Structured Streaming Testing Support > > > Key: SPARK-32583 > URL: https://issues.apache.org/jira/browse/SPARK-32583 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello, > I have investigated a lot but couldn't get any help or resource on how to > +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ > (ingesting from Kafka topics to S3) and how to build Continuous Integration > (CI)/Continuous Deployment (CD). > 1. Is it possible to test (unit test, integration test) pyspark structured > stream? > 2. How to build Continuous Integration (CI)/Continuous Deployment (CD)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support
[ https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175200#comment-17175200 ] Rohit Mishra commented on SPARK-32583: -- [~FelixKJose], Questions are usually answered using either Stackoverflow or User-mail list. Please look at this link for more info-[https://spark.apache.org/community.html]. This task will be marked as resolved here for now but you can reopen it if you want. Thanks. > PySpark Structured Streaming Testing Support > > > Key: SPARK-32583 > URL: https://issues.apache.org/jira/browse/SPARK-32583 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello, > I have investigated a lot but couldn't get any help or resource on how to > +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ > (ingesting from Kafka topics to S3) and how to build Continuous Integration > (CI)/Continuous Deployment (CD). > 1. Is it possible to test (unit test, integration test) pyspark structured > stream? > 2. How to build Continuous Integration (CI)/Continuous Deployment (CD)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175197#comment-17175197 ] Lantao Jin commented on SPARK-32582: I see. The implementation of {{inferSchema}} method depends on the underlay file format. Even for Orc, we still need all files since the given Orc files can have different schemas and we want to get a merged schema. > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32216) Remove redundant ProjectExec
[ https://issues.apache.org/jira/browse/SPARK-32216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32216: --- Assignee: Allison Wang > Remove redundant ProjectExec > > > Key: SPARK-32216 > URL: https://issues.apache.org/jira/browse/SPARK-32216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.1.0 > > > Currently Spark executed plan can have redundant `ProjectExec` node. For > example: > After Filter: > {code:java} > == Physical Plan == > *(1) Project [a#14L, b#15L, c#16, key#17] > +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5)) > +- *(1) ColumnarToRow > +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code} > The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is > exactly the same as filter's output. > Before Aggregate: > {code:java} > == Physical Plan == > *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], > output=[sum_a#39L, key#17, last_b#41L]) > +- Exchange hashpartitioning(key#17, 5), true, [id=#77] >+- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), > partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51]) > +- *(1) Project [key#17, a#14L, b#15L] > +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100)) > +- *(1) ColumnarToRow >+- FileScan parquet [a#14L,b#15L,key#17] {code} > The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate > doesn't require child plan's output to be in a specific order. > > In general, a project is redundant when > # It has the same output attributes and order as its child's output when > ordering of these attributes is required. > # It has the same output attributes as its child's output when attribute > output ordering is not required. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32216) Remove redundant ProjectExec
[ https://issues.apache.org/jira/browse/SPARK-32216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32216. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29031 [https://github.com/apache/spark/pull/29031] > Remove redundant ProjectExec > > > Key: SPARK-32216 > URL: https://issues.apache.org/jira/browse/SPARK-32216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Allison Wang >Priority: Major > Fix For: 3.1.0 > > > Currently Spark executed plan can have redundant `ProjectExec` node. For > example: > After Filter: > {code:java} > == Physical Plan == > *(1) Project [a#14L, b#15L, c#16, key#17] > +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5)) > +- *(1) ColumnarToRow > +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code} > The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is > exactly the same as filter's output. > Before Aggregate: > {code:java} > == Physical Plan == > *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], > output=[sum_a#39L, key#17, last_b#41L]) > +- Exchange hashpartitioning(key#17, 5), true, [id=#77] >+- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), > partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51]) > +- *(1) Project [key#17, a#14L, b#15L] > +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100)) > +- *(1) ColumnarToRow >+- FileScan parquet [a#14L,b#15L,key#17] {code} > The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate > doesn't require child plan's output to be in a specific order. > > In general, a project is redundant when > # It has the same output attributes and order as its child's output when > ordering of these attributes is required. > # It has the same output attributes as its child's output when attribute > output ordering is not required. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32400) Test coverage of HiveScripTransformationExec
[ https://issues.apache.org/jira/browse/SPARK-32400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32400: Assignee: (was: Apache Spark) > Test coverage of HiveScripTransformationExec > > > Key: SPARK-32400 > URL: https://issues.apache.org/jira/browse/SPARK-32400 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32400) Test coverage of HiveScripTransformationExec
[ https://issues.apache.org/jira/browse/SPARK-32400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32400: Assignee: Apache Spark > Test coverage of HiveScripTransformationExec > > > Key: SPARK-32400 > URL: https://issues.apache.org/jira/browse/SPARK-32400 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32400) Test coverage of HiveScripTransformationExec
[ https://issues.apache.org/jira/browse/SPARK-32400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175167#comment-17175167 ] Apache Spark commented on SPARK-32400: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/29401 > Test coverage of HiveScripTransformationExec > > > Key: SPARK-32400 > URL: https://issues.apache.org/jira/browse/SPARK-32400 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175165#comment-17175165 ] Jarred Li commented on SPARK-32582: --- The performance I mentioned here is not the read file, but "LIST" the files. For example, one table have 1000 partitions, the files in that 1000 partitions are listed first. However only one file is read for schema inference. The "LIST" operation is time consumping especially for object store such as S3. See the list files code: [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#300] > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32543) Remove arrow::as_tibble usage in SparkR
[ https://issues.apache.org/jira/browse/SPARK-32543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32543: - Fix Version/s: 3.0.1 > Remove arrow::as_tibble usage in SparkR > --- > > Key: SPARK-32543 > URL: https://issues.apache.org/jira/browse/SPARK-32543 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 3.0.1, 3.1.0 > > > We increased the minimal version of Arrow R version to 1.0.0 at SPARK-32452, > and Arrow R 0.14 dropped {{as_tibble}}. We can remove the usage in SparkR: > {code} > pkg/R/DataFrame.R: # Arrow drops `as_tibble` since 0.14.0, > see ARROW-5190. > pkg/R/DataFrame.R: if (exists("as_tibble", envir = > asNamespace("arrow"))) { > pkg/R/DataFrame.R: > as.data.frame(arrow::as_tibble(arrowTable), stringsAsFactors = > stringsAsFactors) > pkg/R/deserialize.R:# Arrow drops `as_tibble` since 0.14.0, see > ARROW-5190. > pkg/R/deserialize.R:useAsTibble <- exists("as_tibble", envir = > asNamespace("arrow")) > pkg/R/deserialize.R: as_tibble <- get("as_tibble", envir = > asNamespace("arrow")) > pkg/R/deserialize.R: lapply(batches, function(batch) > as.data.frame(as_tibble(batch))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance
[ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175143#comment-17175143 ] Lantao Jin commented on SPARK-32582: {code} files.toIterator.map(file => readSchema(file.getPath, conf, ignoreCorruptFiles)).collectFirst {code} {{collectFirst()}} will break out when its iterator found a matching value. {code} def collectFirst[B](pf: PartialFunction[A, B]): Option[B] = { for (x <- self.toIterator) { // make sure to use an iterator or `seq` if (pf isDefinedAt x) return Some(pf(x)) } None } {code} So in most cases, it just reads only one file. > Spark SQL Infer Schema Performance > -- > > Key: SPARK-32582 > URL: https://issues.apache.org/jira/browse/SPARK-32582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Jarred Li >Priority: Major > > When infer schema is enabled, it tries to list all the files in the table, > however only one of the file is used to read schema informaiton. The > performance is impacted due to list all the files in the table when the > number of partitions is larger. > > See the code in > "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];, > all the files in the table are input, however only one of the file's schema > is used to infer schema. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175117#comment-17175117 ] Tianchen Zhang commented on SPARK-25299: Hi [~mcheah], by following the history of this remote storage project, we know that the community's initial attempt was to resolve the limitations of the shuffle service. Does it imply that it's recommended to have custom remote storage implementation to avoid any limitations that a shuffle service may cause? But for Spark on K8S we also notice the experimental option of enabling shuffle service as a DaemonSet. So we want to make sure what we are doing is in the right direction with the community. Thanks. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Matt Cheah >Priority: Major > Labels: SPIP > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. > Edit June 28 2019: Our SPIP is here: > [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32583) PySpark Structured Streaming Testing Support
Felix Kizhakkel Jose created SPARK-32583: Summary: PySpark Structured Streaming Testing Support Key: SPARK-32583 URL: https://issues.apache.org/jira/browse/SPARK-32583 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 2.4.4 Reporter: Felix Kizhakkel Jose Hello, I have investigated a lot but couldn't get any help or resource on how to +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ (ingesting from Kafka topics to S3) and how to build Continuous Integration (CI)/Continuous Deployment (CD). 1. Is it possible to test (unit test, integration test) pyspark structured stream? 2. How to build Continuous Integration (CI)/Continuous Deployment (CD)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32528) The analyze method should make sure the plan is analyzed
[ https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32528: -- Fix Version/s: 3.0.1 > The analyze method should make sure the plan is analyzed > > > Key: SPARK-32528 > URL: https://issues.apache.org/jira/browse/SPARK-32528 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > In tests, we usually call `plan.analyze` to get the analyzed plan and test > analyzer/optimizer rules. However, `plan.analyze` doesn't guarantee the plan > is resolved, which may surprise the test writers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32571) yarnClient.killApplication(appId) is never called
[ https://issues.apache.org/jira/browse/SPARK-32571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175096#comment-17175096 ] A Tester commented on SPARK-32571: -- [~viirya] , if the application is a spark-streaming application, I can understand that... but for batch-job style applications, it's typically desirable to have the life-cycle of the spark application on the cluster to be synchronous to the life-cycle of the batch job itself (i.e. a job being scheduled by Apache Airflow, Autosys, etc.). The existing (although unreachable) code leads me to believe the same was the original intent. While Client mode would work, in my use case, it's a major benefit to run in cluster mode since the resources for the Driver will come from the cluster instead of the box the job is launched from. However, an implementation triggered via some spark option would also be acceptable. What do you think? I have only ever used spark with yarn, not sure if the same signal propagation would be useful for other options (kubernetes, standalone, etc.) Something like ||Property Name||Default||Meaning||Since Version|| |spark.yarn.propigateKillSignal|{{False}}|If enabled and the job is running in Cluster mode, if the spark-submit process is killed gracefully, then the job running in yarn will also be killed.|...| > yarnClient.killApplication(appId) is never called > - > > Key: SPARK-32571 > URL: https://issues.apache.org/jira/browse/SPARK-32571 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.0, 3.0.0 >Reporter: A Tester >Priority: Major > > *Problem Statement:* > When an application is submitted using spark-submit in cluster mode using > yarn, the spark application continues to run on the cluster, even if > spark-submit itself has been requested to shutdown (Ctrl-C/SIGTERM/etc.) > While there is code inside org.apache.spark.deploy.yarn.Client.scala that > would lead you to believe the spark application on the cluster will shut > down, this code is not currently reachable. > Example of behavior: > spark-submit ... > or kill -15 > spark-submit itself dies > job can still be found running on the cluster > > *Expectation:* > When spark-submit is in monitoring a yarn app and spark-submit itself is > requested to shutdown (SIGTERM, HUP,etc.), it should call > yarnClient.killApplication(appId) so that the actual spark application > running on the cluster is killed. > > > *Proposal* > There is already a shutdown hook registered which cleans up temp files. > Could this be extended to call yarnClient.killApplication? > I believe the default behavior should be to request yarn to kill the > application, however I can imagine use cases where you may still want it to > run. So facilitate these use cases, an option should be provided to skip > this hook. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32577) Fix the config value for shuffled hash join in test in-joins.sql
[ https://issues.apache.org/jira/browse/SPARK-32577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-32577: - Parent: SPARK-32461 Issue Type: Sub-task (was: Improvement) > Fix the config value for shuffled hash join in test in-joins.sql > > > Key: SPARK-32577 > URL: https://issues.apache.org/jira/browse/SPARK-32577 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > The config value for enabling shuffled hash join is wrong in test > [https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-joins.sql#L8-L10], > which should not be spark.sql.autoBroadcastJoinThreshold=-1. Created this > Jira to fix this. This is a very minor change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175052#comment-17175052 ] Sean R. Owen commented on SPARK-32530: -- I speak for myself, but, I think the cost is viewed as very high. Which is valid; maintaining every change across 3 languages has proved to be a lot of work. As I say, I'm not even sure of the future of the R bindings; despite hopes that these would be equally maintained, they just aren't at parity, even. I think the argument would have to be: this opens up Spark to a class of users that can't really otherwise use it. JVM users use Spark directly (or Java, more rarely), Python users can use Pyspark; R users for now can use SparkR or sparklyr, which is a separate project. I am not sure how many users use Kotlin, but can't use Java or Scala? I fully trust the intentions of any group that says "we will maintain our contribution" despite the inevitable fact that it doesn't always work out that way in OSS. If the Kotlin bindings are in Spark, notionally, they have to be updated by anyone touching any APIs, etc. That just will never only fall on one group of maintainers. OK, so you say, it's fine if they lag a bit, differ a bit. If some degree of 'lag' is OK, then I think that neuters the argument, in the short term, for pushing it into Spark now. Just let it succeed as-is, as a third-party project. As you say, .NET is 'harder' to run this way but its bindings have done OK separately. What goes wrong if it stays in its current form? > SPIP: Kotlin support for Apache Spark > - > > Key: SPARK-32530 > URL: https://issues.apache.org/jira/browse/SPARK-32530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Pasha Finkeshteyn >Priority: Major > > h2. Background and motivation > Kotlin is a cross-platform, statically typed, general-purpose JVM language. > In the last year more than 5 million developers have used Kotlin in mobile, > backend, frontend and scientific development. The number of Kotlin developers > grows rapidly every year. > * [According to > redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: > "Kotlin, the second fastest growing language we’ve seen outside of Swift, > made a big splash a year ago at this time when it vaulted eight full spots up > the list." > * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], > Kotlin is the second most popular language on the JVM > * [According to > StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share > increased by 7.8% in 2020. > We notice the increasing usage of Kotlin in data analysis ([6% of users in > 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to > 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in > 2019), and we expect these numbers to continue to grow. > We, authors of this SPIP, strongly believe that making Kotlin API officially > available to developers can bring new users to Apache Spark and help some of > the existing users. > h2. Goals > The goal of this project is to bring first-class support for Kotlin language > into the Apache Spark project. We’re going to achieve this by adding one more > module to the current Apache Spark distribution. > h2. Non-goals > There is no goal to replace any existing language support or to change any > existing Apache Spark API. > At this time, there is no goal to support non-core APIs of Apache Spark like > Spark ML and Spark structured streaming. This may change in the future based > on community feedback. > There is no goal to provide CLI for Kotlin for Apache Spark, this will be a > separate SPIP. > There is no goal to provide support for Apache Spark < 3.0.0. > h2. Current implementation > A working prototype is available at > [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside > JetBrains and by early adopters. > h2. What are the risks? > There is always a risk that this product won’t get enough popularity and will > bring more costs than benefits. It can be mitigated by the fact that we don't > need to change any existing API and support can be potentially dropped at any > time. > We also believe that existing API is rather low maintenance. It does not > bring anything more complex than already exists in the Spark codebase. > Furthermore, the implementation is compact - less than 2000 lines of code. > We are committed to maintaining, improving and evolving the API based on > feedback from both Spark and Kotlin communities. As the Kotlin data community > continues to grow, we see Kotlin API for Apache Spark as an important part in > the evolving Kotlin ecosystem, and intend to fully support it. > h2. How
[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175010#comment-17175010 ] Pasha Finkeshteyn commented on SPARK-32530: --- Hi [~srowen], we agree with you that it’s good to have more first-class language bindings for Spark. We also agree with you that if Kotlin is part of Apache Spark, it becomes more 'official' and trusted. Of course, as with anything, there’s a maintenance cost, but it isn’t fair to compare it to .NET bindings, Python or R. After all, Kotlin is a JVM language and compatibility comes way easier. Still, there’s maintenance. As we say in the ticket, we are committed to maintaining, improving and evolving the API based on feedback from both Spark and Kotlin communities, and we intend to fully support it. I can understand that this may not be enough, but we hope that with time and growing community, the value of integrating Kotlin bindings into Apache Spark becomes more apparent. It would be great if you could shed some light on how does the Spark project's management committee measure the value vs cost, and what would be the tipping point. > SPIP: Kotlin support for Apache Spark > - > > Key: SPARK-32530 > URL: https://issues.apache.org/jira/browse/SPARK-32530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Pasha Finkeshteyn >Priority: Major > > h2. Background and motivation > Kotlin is a cross-platform, statically typed, general-purpose JVM language. > In the last year more than 5 million developers have used Kotlin in mobile, > backend, frontend and scientific development. The number of Kotlin developers > grows rapidly every year. > * [According to > redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: > "Kotlin, the second fastest growing language we’ve seen outside of Swift, > made a big splash a year ago at this time when it vaulted eight full spots up > the list." > * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], > Kotlin is the second most popular language on the JVM > * [According to > StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share > increased by 7.8% in 2020. > We notice the increasing usage of Kotlin in data analysis ([6% of users in > 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to > 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in > 2019), and we expect these numbers to continue to grow. > We, authors of this SPIP, strongly believe that making Kotlin API officially > available to developers can bring new users to Apache Spark and help some of > the existing users. > h2. Goals > The goal of this project is to bring first-class support for Kotlin language > into the Apache Spark project. We’re going to achieve this by adding one more > module to the current Apache Spark distribution. > h2. Non-goals > There is no goal to replace any existing language support or to change any > existing Apache Spark API. > At this time, there is no goal to support non-core APIs of Apache Spark like > Spark ML and Spark structured streaming. This may change in the future based > on community feedback. > There is no goal to provide CLI for Kotlin for Apache Spark, this will be a > separate SPIP. > There is no goal to provide support for Apache Spark < 3.0.0. > h2. Current implementation > A working prototype is available at > [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside > JetBrains and by early adopters. > h2. What are the risks? > There is always a risk that this product won’t get enough popularity and will > bring more costs than benefits. It can be mitigated by the fact that we don't > need to change any existing API and support can be potentially dropped at any > time. > We also believe that existing API is rather low maintenance. It does not > bring anything more complex than already exists in the Spark codebase. > Furthermore, the implementation is compact - less than 2000 lines of code. > We are committed to maintaining, improving and evolving the API based on > feedback from both Spark and Kotlin communities. As the Kotlin data community > continues to grow, we see Kotlin API for Apache Spark as an important part in > the evolving Kotlin ecosystem, and intend to fully support it. > h2. How long will it take? > A working implementation is already available, and if the community will > have any proposal of changes for this implementation to be improved, these > can be implemented quickly — in weeks if not days. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-32528) The analyze method should make sure the plan is analyzed
[ https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174989#comment-17174989 ] Apache Spark commented on SPARK-32528: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29400 > The analyze method should make sure the plan is analyzed > > > Key: SPARK-32528 > URL: https://issues.apache.org/jira/browse/SPARK-32528 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 3.1.0 > > > In tests, we usually call `plan.analyze` to get the analyzed plan and test > analyzer/optimizer rules. However, `plan.analyze` doesn't guarantee the plan > is resolved, which may surprise the test writers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32528) The analyze method should make sure the plan is analyzed
[ https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174988#comment-17174988 ] Apache Spark commented on SPARK-32528: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29400 > The analyze method should make sure the plan is analyzed > > > Key: SPARK-32528 > URL: https://issues.apache.org/jira/browse/SPARK-32528 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 3.1.0 > > > In tests, we usually call `plan.analyze` to get the analyzed plan and test > analyzer/optimizer rules. However, `plan.analyze` doesn't guarantee the plan > is resolved, which may surprise the test writers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent
[ https://issues.apache.org/jira/browse/SPARK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32469. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29273 [https://github.com/apache/spark/pull/29273] > ApplyColumnarRulesAndInsertTransitions should be idempotent > --- > > Key: SPARK-32469 > URL: https://issues.apache.org/jira/browse/SPARK-32469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0 > > > It's good hygiene to keep rules idempotent, even if we know the rule is going > to be run only once. This is for future-proof. > ApplyColumnarRulesAndInsertTransitions can add columnar-to-row and > row-to-columnar operators repeatedly, we should fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32403) SCRIP TRANSFORM Extract common method from process row to avoid repeated judgement
[ https://issues.apache.org/jira/browse/SPARK-32403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32403. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29199 [https://github.com/apache/spark/pull/29199] > SCRIP TRANSFORM Extract common method from process row to avoid repeated > judgement > -- > > Key: SPARK-32403 > URL: https://issues.apache.org/jira/browse/SPARK-32403 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32403) SCRIP TRANSFORM Extract common method from process row to avoid repeated judgement
[ https://issues.apache.org/jira/browse/SPARK-32403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32403: --- Assignee: angerszhu > SCRIP TRANSFORM Extract common method from process row to avoid repeated > judgement > -- > > Key: SPARK-32403 > URL: https://issues.apache.org/jira/browse/SPARK-32403 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32409) Document the dependency between spark.metrics.staticSources.enabled and JVMSource registration
[ https://issues.apache.org/jira/browse/SPARK-32409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32409: -- Issue Type: Documentation (was: Bug) > Document the dependency between spark.metrics.staticSources.enabled and > JVMSource registration > -- > > Key: SPARK-32409 > URL: https://issues.apache.org/jira/browse/SPARK-32409 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > SPARK-29654 has introduced the configuration > `spark.metrics.register.static.sources.enabled`. > The current implementation of SPARK-29654 and SPARK-25277 have, as side > effect, that when {{spark.metrics.register.static.sources.enabled}} is set to > false, the registration of JVM Source is also ignored, even if requested with > {{spark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource}} > A PR is proposed to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32409) Document the dependency between spark.metrics.staticSources.enabled and JVMSource registration
[ https://issues.apache.org/jira/browse/SPARK-32409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32409: - Assignee: Luca Canali > Document the dependency between spark.metrics.staticSources.enabled and > JVMSource registration > -- > > Key: SPARK-32409 > URL: https://issues.apache.org/jira/browse/SPARK-32409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > SPARK-29654 has introduced the configuration > `spark.metrics.register.static.sources.enabled`. > The current implementation of SPARK-29654 and SPARK-25277 have, as side > effect, that when {{spark.metrics.register.static.sources.enabled}} is set to > false, the registration of JVM Source is also ignored, even if requested with > {{spark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource}} > A PR is proposed to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32409) Document the dependency between spark.metrics.staticSources.enabled and JVMSource registration
[ https://issues.apache.org/jira/browse/SPARK-32409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32409. --- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 29203 [https://github.com/apache/spark/pull/29203] > Document the dependency between spark.metrics.staticSources.enabled and > JVMSource registration > -- > > Key: SPARK-32409 > URL: https://issues.apache.org/jira/browse/SPARK-32409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > SPARK-29654 has introduced the configuration > `spark.metrics.register.static.sources.enabled`. > The current implementation of SPARK-29654 and SPARK-25277 have, as side > effect, that when {{spark.metrics.register.static.sources.enabled}} is set to > false, the registration of JVM Source is also ignored, even if requested with > {{spark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource}} > A PR is proposed to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32235) Kubernetes Configuration to set Service Account to Executors
[ https://issues.apache.org/jira/browse/SPARK-32235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pedro Rossi resolved SPARK-32235. - Fix Version/s: 3.1.0 Resolution: Fixed Issue was solved by SPARK-30122, but the same was not released yet > Kubernetes Configuration to set Service Account to Executors > > > Key: SPARK-32235 > URL: https://issues.apache.org/jira/browse/SPARK-32235 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Pedro Rossi >Priority: Minor > Fix For: 3.1.0 > > > Some cloud providers use Service Accounts to provide resource authorization > (one example is described here > [https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/)] > and for this we need to be able to set Service Accounts to the executors. > My idea for development of this feature would be to have a configuration like > "spark.kubernetes.authenticate.executor.serviceAccountName" in order to set > the executors Service Account, this way it could be possible to allow only > certain accesses to the driver and others to the executors or the same access > (user's choice). > I am creating this issue so the maintainers can write opinions first, but I > intend to create a pull request to address this issue also. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32580) Issue accessing a column values after 'explode' function
[ https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174361#comment-17174361 ] Takeshi Yamamuro commented on SPARK-32580: -- I made the given query simpler; {code} scala> import org.apache.spark.sql.functions scala> val df1 = spark.range(1).selectExpr("array(named_struct('item_id', 1, 'values', array(named_struct('sample', 0.1 data") scala> val df2 = df1.withColumn("item", functions.explode(df1.col("data"))) scala> val df3 = df2.withColumn("value", functions.explode(df2.col("item.values"))) scala> val df4 = df3.select(df3.col("item.item_id"), df3.col("value.sample")) scala> df4.show() 20/08/10 22:53:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _gen_alias_26#26 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark. {code} Looks like this issue only happens in branch-3.0. It passed in master/branch-2.4. > Issue accessing a column values after 'explode' function > > > Key: SPARK-32580 > URL: https://issues.apache.org/jira/browse/SPARK-32580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ayrat Sadreev >Priority: Major > Attachments: ExplodeTest.java, data.json > > > An exception occurs when trying to flatten double nested arrays > The schema is > {code:none} > root > |-- data: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- item_id: string (nullable = true) > |||-- timestamp: string (nullable = true) > |||-- values: array (nullable = true) > ||||-- element: struct (containsNull = true) > |||||-- sample: double (nullable = true) > {code} > The target schema is > {code:none} > root > |-- item_id: string (nullable = true) > |-- timestamp: string (nullable = true) > |-- sample: double (nullable = true) > {code} > The code (in Java) > {code:java} > package com.skf.streamer.spark; > import java.util.concurrent.TimeoutException; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.functions; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > public class ExplodeTest { >public static void main(String[] args) throws TimeoutException { > SparkConf conf = new SparkConf() > .setAppName("SimpleApp") > .set("spark.scheduler.mode", "FAIR") > .set("spark.master", "local[1]") > .set("spark.sql.streaming.checkpointLocation", "checkpoint"); > SparkSession spark = SparkSession.builder() > .config(conf) > .getOrCreate(); > Dataset d0 = spark > .read() > .format("json") > .option("multiLine", "true") > .schema(getSchema()) > .load("src/test/resources/explode/data.json"); > d0.printSchema(); > d0 = d0.withColumn("item", functions.explode(d0.col("data"))); > d0 = d0.withColumn("value", functions.explode(d0.col("item.values"))); > d0.printSchema(); > d0 = d0.select( > d0.col("item.item_id"), > d0.col("item.timestamp"), > d0.col("value.sample") > ); > d0.printSchema(); > d0.show(); // Failes > spark.stop(); >} >private static StructType getSchema() { > StructField[] level2Fields = { > DataTypes.createStructField("sample", DataTypes.DoubleType, false), > }; > StructField[] level1Fields = { > DataTypes.createStructField("item_id", DataTypes.StringType, false), > DataTypes.createStructField("timestamp", DataTypes.StringType, > false), > DataTypes.createStructField("values", > DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false) > }; > StructField[] fields = { > DataTypes.createStructField("data", > DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false) > }; > return DataTypes.createStructType(fields); >} > } > {code} > The data file >
[jira] [Comment Edited] (SPARK-32580) Issue accessing a column values after 'explode' function
[ https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174361#comment-17174361 ] Takeshi Yamamuro edited comment on SPARK-32580 at 8/10/20, 2:38 PM: I made the given query simpler; {code} scala> import org.apache.spark.sql.functions scala> val df1 = spark.range(1).selectExpr("array(named_struct('item_id', 1, 'values', array(named_struct('sample', 0.1 data") scala> val df2 = df1.withColumn("item", functions.explode(df1.col("data"))) scala> val df3 = df2.withColumn("value", functions.explode(df2.col("item.values"))) scala> val df4 = df3.select(df3.col("item.item_id"), df3.col("value.sample")) scala> df4.show() 20/08/10 22:53:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _gen_alias_26#26 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark. {code} Looks like this issue only happens in v3.0.0. It passed in master/v2.4.6. was (Author: maropu): I made the given query simpler; {code} scala> import org.apache.spark.sql.functions scala> val df1 = spark.range(1).selectExpr("array(named_struct('item_id', 1, 'values', array(named_struct('sample', 0.1 data") scala> val df2 = df1.withColumn("item", functions.explode(df1.col("data"))) scala> val df3 = df2.withColumn("value", functions.explode(df2.col("item.values"))) scala> val df4 = df3.select(df3.col("item.item_id"), df3.col("value.sample")) scala> df4.show() 20/08/10 22:53:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _gen_alias_26#26 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark. {code} Looks like this issue only happens in branch-3.0. It passed in master/branch-2.4. > Issue accessing a column values after 'explode' function > > > Key: SPARK-32580 > URL: https://issues.apache.org/jira/browse/SPARK-32580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ayrat Sadreev >Priority: Major > Attachments: ExplodeTest.java, data.json > > > An exception occurs when trying to flatten double nested arrays > The schema is > {code:none} > root > |-- data: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- item_id: string (nullable = true) > |||-- timestamp: string (nullable = true) > |||-- values: array (nullable = true) > ||||-- element: struct (containsNull = true) > |||||-- sample: double (nullable = true) > {code} > The target schema is > {code:none} > root > |-- item_id: string (nullable = true) > |-- timestamp: string (nullable = true) > |-- sample: double (nullable = true) > {code} > The code (in Java) > {code:java} > package com.skf.streamer.spark; > import java.util.concurrent.TimeoutException; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.functions; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > public class ExplodeTest { >public static void main(String[] args) throws TimeoutException { > SparkConf conf = new SparkConf() > .setAppName("SimpleApp") > .set("spark.scheduler.mode", "FAIR") > .set("spark.master", "local[1]") > .set("spark.sql.streaming.checkpointLocation", "checkpoint"); > SparkSession spark = SparkSession.builder() > .config(conf) > .getOrCreate(); > Dataset d0 =
[jira] [Assigned] (SPARK-32517) Add StorageLevel.DISK_ONLY_3
[ https://issues.apache.org/jira/browse/SPARK-32517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32517: - Assignee: Dongjoon Hyun > Add StorageLevel.DISK_ONLY_3 > > > Key: SPARK-32517 > URL: https://issues.apache.org/jira/browse/SPARK-32517 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32517) Add StorageLevel.DISK_ONLY_3
[ https://issues.apache.org/jira/browse/SPARK-32517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32517. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29331 [https://github.com/apache/spark/pull/29331] > Add StorageLevel.DISK_ONLY_3 > > > Key: SPARK-32517 > URL: https://issues.apache.org/jira/browse/SPARK-32517 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > > This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32582) Spark SQL Infer Schema Performance
Jarred Li created SPARK-32582: - Summary: Spark SQL Infer Schema Performance Key: SPARK-32582 URL: https://issues.apache.org/jira/browse/SPARK-32582 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0, 2.4.6 Reporter: Jarred Li When infer schema is enabled, it tries to list all the files in the table, however only one of the file is used to read schema informaiton. The performance is impacted due to list all the files in the table when the number of partitions is larger. See the code in "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];, all the files in the table are input, however only one of the file's schema is used to infer schema. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32554) Remove the words "experimental" in the k8s document
[ https://issues.apache.org/jira/browse/SPARK-32554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32554: -- Description: To make users understood more correctly about the current development status of the k8s scheduler, this ticket targets at updating the k8s document in the primary branch/branch-3.0; BEFORE: {code:java} The Kubernetes scheduler is currently experimental. In future versions, there may be behavioral changes around configuration, container images and entrypoints.{code} AFTER: {code:java} {code} This comes from a thread in the spark-dev mailing list: [http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html] was: To make users understood more correctly about the current development status of the k8s scheduler, this ticket targets at updating the k8s document in the primary branch/branch-3.0; BEFORE: {code:java} The Kubernetes scheduler is currently experimental. In future versions, there may be behavioral changes around configuration, container images and entrypoints.{code} AFTER: {code:java} The Kubernetes scheduler is currently experimental. The most basic parts are getting stable, but Dynamic Resource Allocation and External Shuffle Service need to be available before we officially announce GA for it.{code} This comes from a thread in the spark-dev mailing list: [http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html] > Remove the words "experimental" in the k8s document > --- > > Key: SPARK-32554 > URL: https://issues.apache.org/jira/browse/SPARK-32554 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes >Affects Versions: 3.0.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.1.0 > > > To make users understood more correctly about the current development status > of the k8s scheduler, this ticket targets at updating the k8s document in the > primary branch/branch-3.0; > BEFORE: > {code:java} > The Kubernetes scheduler is currently experimental. In future versions, > there may be behavioral changes around > configuration, container images and entrypoints.{code} > AFTER: > {code:java} > {code} > This comes from a thread in the spark-dev mailing list: > [http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32554) Remove the words "experimental" in the k8s document
[ https://issues.apache.org/jira/browse/SPARK-32554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32554: -- Summary: Remove the words "experimental" in the k8s document (was: Update the k8s document according to the current development status) > Remove the words "experimental" in the k8s document > --- > > Key: SPARK-32554 > URL: https://issues.apache.org/jira/browse/SPARK-32554 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes >Affects Versions: 3.0.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.1.0 > > > To make users understood more correctly about the current development status > of the k8s scheduler, this ticket targets at updating the k8s document in the > primary branch/branch-3.0; > BEFORE: > {code:java} > The Kubernetes scheduler is currently experimental. In future versions, > there may be behavioral changes around > configuration, container images and entrypoints.{code} > AFTER: > {code:java} > The Kubernetes scheduler is currently experimental. The most basic parts are > getting stable, but Dynamic > Resource Allocation and External Shuffle Service need to be available before > we officially announce GA for it.{code} > This comes from a thread in the spark-dev mailing list: > [http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32554) Update the k8s document according to the current development status
[ https://issues.apache.org/jira/browse/SPARK-32554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32554: - Assignee: Takeshi Yamamuro > Update the k8s document according to the current development status > --- > > Key: SPARK-32554 > URL: https://issues.apache.org/jira/browse/SPARK-32554 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes >Affects Versions: 3.0.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > > To make users understood more correctly about the current development status > of the k8s scheduler, this ticket targets at updating the k8s document in the > primary branch/branch-3.0; > BEFORE: > {code:java} > The Kubernetes scheduler is currently experimental. In future versions, > there may be behavioral changes around > configuration, container images and entrypoints.{code} > AFTER: > {code:java} > The Kubernetes scheduler is currently experimental. The most basic parts are > getting stable, but Dynamic > Resource Allocation and External Shuffle Service need to be available before > we officially announce GA for it.{code} > This comes from a thread in the spark-dev mailing list: > [http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32554) Update the k8s document according to the current development status
[ https://issues.apache.org/jira/browse/SPARK-32554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32554. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29368 [https://github.com/apache/spark/pull/29368] > Update the k8s document according to the current development status > --- > > Key: SPARK-32554 > URL: https://issues.apache.org/jira/browse/SPARK-32554 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes >Affects Versions: 3.0.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.1.0 > > > To make users understood more correctly about the current development status > of the k8s scheduler, this ticket targets at updating the k8s document in the > primary branch/branch-3.0; > BEFORE: > {code:java} > The Kubernetes scheduler is currently experimental. In future versions, > there may be behavioral changes around > configuration, container images and entrypoints.{code} > AFTER: > {code:java} > The Kubernetes scheduler is currently experimental. The most basic parts are > getting stable, but Dynamic > Resource Allocation and External Shuffle Service need to be available before > we officially announce GA for it.{code} > This comes from a thread in the spark-dev mailing list: > [http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32580) Issue accessing a column values after 'explode' function
[ https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayrat Sadreev updated SPARK-32580: -- Attachment: ExplodeTest.java > Issue accessing a column values after 'explode' function > > > Key: SPARK-32580 > URL: https://issues.apache.org/jira/browse/SPARK-32580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ayrat Sadreev >Priority: Major > Attachments: ExplodeTest.java, data.json > > > An exception occurs when trying to flatten double nested arrays > The schema is > {code:none} > root > |-- data: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- item_id: string (nullable = true) > |||-- timestamp: string (nullable = true) > |||-- values: array (nullable = true) > ||||-- element: struct (containsNull = true) > |||||-- sample: double (nullable = true) > {code} > The target schema is > {code:none} > root > |-- item_id: string (nullable = true) > |-- timestamp: string (nullable = true) > |-- sample: double (nullable = true) > {code} > The code (in Java) > {code:java} > package com.skf.streamer.spark; > import java.util.concurrent.TimeoutException; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.functions; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > public class ExplodeTest { >public static void main(String[] args) throws TimeoutException { > SparkConf conf = new SparkConf() > .setAppName("SimpleApp") > .set("spark.scheduler.mode", "FAIR") > .set("spark.master", "local[1]") > .set("spark.sql.streaming.checkpointLocation", "checkpoint"); > SparkSession spark = SparkSession.builder() > .config(conf) > .getOrCreate(); > Dataset d0 = spark > .read() > .format("json") > .option("multiLine", "true") > .schema(getSchema()) > .load("src/test/resources/explode/data.json"); > d0.printSchema(); > d0 = d0.withColumn("item", functions.explode(d0.col("data"))); > d0 = d0.withColumn("value", functions.explode(d0.col("item.values"))); > d0.printSchema(); > d0 = d0.select( > d0.col("item.item_id"), > d0.col("item.timestamp"), > d0.col("value.sample") > ); > d0.printSchema(); > d0.show(); // Failes > spark.stop(); >} >private static StructType getSchema() { > StructField[] level2Fields = { > DataTypes.createStructField("sample", DataTypes.DoubleType, false), > }; > StructField[] level1Fields = { > DataTypes.createStructField("item_id", DataTypes.StringType, false), > DataTypes.createStructField("timestamp", DataTypes.StringType, > false), > DataTypes.createStructField("values", > DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false) > }; > StructField[] fields = { > DataTypes.createStructField("data", > DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false) > }; > return DataTypes.createStructType(fields); >} > } > {code} > The data file > {code:json} > { > "data": [ > { > "item_id": "item_1", > "timestamp": "2020-07-01 12:34:89", > "values": [ > { > "sample": 1.1 > }, > { > "sample": 1.2 > } > ] > }, > { > "item_id": "item_2", > "timestamp": "2020-07-02 12:34:89", > "values": [ > { > "sample": 2.2 > } > ] > } > ] > } > {code} > Dataset.show() method fails with an exception > {code:none} > Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in > [_gen_alias_28#28,_gen_alias_29#29] > at scala.sys.package$.error(package.scala:30) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 37 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32580) Issue accessing a column values after 'explode' function
[ https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayrat Sadreev updated SPARK-32580: -- Attachment: data.json > Issue accessing a column values after 'explode' function > > > Key: SPARK-32580 > URL: https://issues.apache.org/jira/browse/SPARK-32580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ayrat Sadreev >Priority: Major > Attachments: ExplodeTest.java, data.json > > > An exception occurs when trying to flatten double nested arrays > The schema is > {code:none} > root > |-- data: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- item_id: string (nullable = true) > |||-- timestamp: string (nullable = true) > |||-- values: array (nullable = true) > ||||-- element: struct (containsNull = true) > |||||-- sample: double (nullable = true) > {code} > The target schema is > {code:none} > root > |-- item_id: string (nullable = true) > |-- timestamp: string (nullable = true) > |-- sample: double (nullable = true) > {code} > The code (in Java) > {code:java} > package com.skf.streamer.spark; > import java.util.concurrent.TimeoutException; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.functions; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > public class ExplodeTest { >public static void main(String[] args) throws TimeoutException { > SparkConf conf = new SparkConf() > .setAppName("SimpleApp") > .set("spark.scheduler.mode", "FAIR") > .set("spark.master", "local[1]") > .set("spark.sql.streaming.checkpointLocation", "checkpoint"); > SparkSession spark = SparkSession.builder() > .config(conf) > .getOrCreate(); > Dataset d0 = spark > .read() > .format("json") > .option("multiLine", "true") > .schema(getSchema()) > .load("src/test/resources/explode/data.json"); > d0.printSchema(); > d0 = d0.withColumn("item", functions.explode(d0.col("data"))); > d0 = d0.withColumn("value", functions.explode(d0.col("item.values"))); > d0.printSchema(); > d0 = d0.select( > d0.col("item.item_id"), > d0.col("item.timestamp"), > d0.col("value.sample") > ); > d0.printSchema(); > d0.show(); // Failes > spark.stop(); >} >private static StructType getSchema() { > StructField[] level2Fields = { > DataTypes.createStructField("sample", DataTypes.DoubleType, false), > }; > StructField[] level1Fields = { > DataTypes.createStructField("item_id", DataTypes.StringType, false), > DataTypes.createStructField("timestamp", DataTypes.StringType, > false), > DataTypes.createStructField("values", > DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false) > }; > StructField[] fields = { > DataTypes.createStructField("data", > DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false) > }; > return DataTypes.createStructType(fields); >} > } > {code} > The data file > {code:json} > { > "data": [ > { > "item_id": "item_1", > "timestamp": "2020-07-01 12:34:89", > "values": [ > { > "sample": 1.1 > }, > { > "sample": 1.2 > } > ] > }, > { > "item_id": "item_2", > "timestamp": "2020-07-02 12:34:89", > "values": [ > { > "sample": 2.2 > } > ] > } > ] > } > {code} > Dataset.show() method fails with an exception > {code:none} > Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in > [_gen_alias_28#28,_gen_alias_29#29] > at scala.sys.package$.error(package.scala:30) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 37 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32580) Issue accessing a column values after 'explode' function
[ https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174303#comment-17174303 ] Takeshi Yamamuro edited comment on SPARK-32580 at 8/10/20, 1:26 PM: Please do not use "Blocker" in the priority and it is reserved for committers. Anyway, thanks for the report. was (Author: maropu): Please do not use "Blocker" in the priority and it is reserved by committers. Anyway, thanks for the report. > Issue accessing a column values after 'explode' function > > > Key: SPARK-32580 > URL: https://issues.apache.org/jira/browse/SPARK-32580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ayrat Sadreev >Priority: Major > > An exception occurs when trying to flatten double nested arrays > The schema is > {code:none} > root > |-- data: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- item_id: string (nullable = true) > |||-- timestamp: string (nullable = true) > |||-- values: array (nullable = true) > ||||-- element: struct (containsNull = true) > |||||-- sample: double (nullable = true) > {code} > The target schema is > {code:none} > root > |-- item_id: string (nullable = true) > |-- timestamp: string (nullable = true) > |-- sample: double (nullable = true) > {code} > The code (in Java) > {code:java} > package com.skf.streamer.spark; > import java.util.concurrent.TimeoutException; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.functions; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > public class ExplodeTest { >public static void main(String[] args) throws TimeoutException { > SparkConf conf = new SparkConf() > .setAppName("SimpleApp") > .set("spark.scheduler.mode", "FAIR") > .set("spark.master", "local[1]") > .set("spark.sql.streaming.checkpointLocation", "checkpoint"); > SparkSession spark = SparkSession.builder() > .config(conf) > .getOrCreate(); > Dataset d0 = spark > .read() > .format("json") > .option("multiLine", "true") > .schema(getSchema()) > .load("src/test/resources/explode/data.json"); > d0.printSchema(); > d0 = d0.withColumn("item", functions.explode(d0.col("data"))); > d0 = d0.withColumn("value", functions.explode(d0.col("item.values"))); > d0.printSchema(); > d0 = d0.select( > d0.col("item.item_id"), > d0.col("item.timestamp"), > d0.col("value.sample") > ); > d0.printSchema(); > d0.show(); // Failes > spark.stop(); >} >private static StructType getSchema() { > StructField[] level2Fields = { > DataTypes.createStructField("sample", DataTypes.DoubleType, false), > }; > StructField[] level1Fields = { > DataTypes.createStructField("item_id", DataTypes.StringType, false), > DataTypes.createStructField("timestamp", DataTypes.StringType, > false), > DataTypes.createStructField("values", > DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false) > }; > StructField[] fields = { > DataTypes.createStructField("data", > DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false) > }; > return DataTypes.createStructType(fields); >} > } > {code} > The data file > {code:json} > { > "data": [ > { > "item_id": "item_1", > "timestamp": "2020-07-01 12:34:89", > "values": [ > { > "sample": 1.1 > }, > { > "sample": 1.2 > } > ] > }, > { > "item_id": "item_2", > "timestamp": "2020-07-02 12:34:89", > "values": [ > { > "sample": 2.2 > } > ] > } > ] > } > {code} > Dataset.show() method fails with an exception > {code:none} > Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in > [_gen_alias_28#28,_gen_alias_29#29] > at scala.sys.package$.error(package.scala:30) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 37 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-32580) Issue accessing a column values after 'explode' function
[ https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174303#comment-17174303 ] Takeshi Yamamuro commented on SPARK-32580: -- Please do not use "Blocker" in the priority and it is reserved by committers. Anyway, thanks for the report. > Issue accessing a column values after 'explode' function > > > Key: SPARK-32580 > URL: https://issues.apache.org/jira/browse/SPARK-32580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ayrat Sadreev >Priority: Major > > An exception occurs when trying to flatten double nested arrays > The schema is > {code:none} > root > |-- data: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- item_id: string (nullable = true) > |||-- timestamp: string (nullable = true) > |||-- values: array (nullable = true) > ||||-- element: struct (containsNull = true) > |||||-- sample: double (nullable = true) > {code} > The target schema is > {code:none} > root > |-- item_id: string (nullable = true) > |-- timestamp: string (nullable = true) > |-- sample: double (nullable = true) > {code} > The code (in Java) > {code:java} > package com.skf.streamer.spark; > import java.util.concurrent.TimeoutException; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.functions; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > public class ExplodeTest { >public static void main(String[] args) throws TimeoutException { > SparkConf conf = new SparkConf() > .setAppName("SimpleApp") > .set("spark.scheduler.mode", "FAIR") > .set("spark.master", "local[1]") > .set("spark.sql.streaming.checkpointLocation", "checkpoint"); > SparkSession spark = SparkSession.builder() > .config(conf) > .getOrCreate(); > Dataset d0 = spark > .read() > .format("json") > .option("multiLine", "true") > .schema(getSchema()) > .load("src/test/resources/explode/data.json"); > d0.printSchema(); > d0 = d0.withColumn("item", functions.explode(d0.col("data"))); > d0 = d0.withColumn("value", functions.explode(d0.col("item.values"))); > d0.printSchema(); > d0 = d0.select( > d0.col("item.item_id"), > d0.col("item.timestamp"), > d0.col("value.sample") > ); > d0.printSchema(); > d0.show(); // Failes > spark.stop(); >} >private static StructType getSchema() { > StructField[] level2Fields = { > DataTypes.createStructField("sample", DataTypes.DoubleType, false), > }; > StructField[] level1Fields = { > DataTypes.createStructField("item_id", DataTypes.StringType, false), > DataTypes.createStructField("timestamp", DataTypes.StringType, > false), > DataTypes.createStructField("values", > DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false) > }; > StructField[] fields = { > DataTypes.createStructField("data", > DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false) > }; > return DataTypes.createStructType(fields); >} > } > {code} > The data file > {code:json} > { > "data": [ > { > "item_id": "item_1", > "timestamp": "2020-07-01 12:34:89", > "values": [ > { > "sample": 1.1 > }, > { > "sample": 1.2 > } > ] > }, > { > "item_id": "item_2", > "timestamp": "2020-07-02 12:34:89", > "values": [ > { > "sample": 2.2 > } > ] > } > ] > } > {code} > Dataset.show() method fails with an exception > {code:none} > Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in > [_gen_alias_28#28,_gen_alias_29#29] > at scala.sys.package$.error(package.scala:30) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 37 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32580) Issue accessing a column values after 'explode' function
[ https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32580: - Component/s: (was: Spark Core) > Issue accessing a column values after 'explode' function > > > Key: SPARK-32580 > URL: https://issues.apache.org/jira/browse/SPARK-32580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ayrat Sadreev >Priority: Major > > An exception occurs when trying to flatten double nested arrays > The schema is > {code:none} > root > |-- data: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- item_id: string (nullable = true) > |||-- timestamp: string (nullable = true) > |||-- values: array (nullable = true) > ||||-- element: struct (containsNull = true) > |||||-- sample: double (nullable = true) > {code} > The target schema is > {code:none} > root > |-- item_id: string (nullable = true) > |-- timestamp: string (nullable = true) > |-- sample: double (nullable = true) > {code} > The code (in Java) > {code:java} > package com.skf.streamer.spark; > import java.util.concurrent.TimeoutException; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.functions; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > public class ExplodeTest { >public static void main(String[] args) throws TimeoutException { > SparkConf conf = new SparkConf() > .setAppName("SimpleApp") > .set("spark.scheduler.mode", "FAIR") > .set("spark.master", "local[1]") > .set("spark.sql.streaming.checkpointLocation", "checkpoint"); > SparkSession spark = SparkSession.builder() > .config(conf) > .getOrCreate(); > Dataset d0 = spark > .read() > .format("json") > .option("multiLine", "true") > .schema(getSchema()) > .load("src/test/resources/explode/data.json"); > d0.printSchema(); > d0 = d0.withColumn("item", functions.explode(d0.col("data"))); > d0 = d0.withColumn("value", functions.explode(d0.col("item.values"))); > d0.printSchema(); > d0 = d0.select( > d0.col("item.item_id"), > d0.col("item.timestamp"), > d0.col("value.sample") > ); > d0.printSchema(); > d0.show(); // Failes > spark.stop(); >} >private static StructType getSchema() { > StructField[] level2Fields = { > DataTypes.createStructField("sample", DataTypes.DoubleType, false), > }; > StructField[] level1Fields = { > DataTypes.createStructField("item_id", DataTypes.StringType, false), > DataTypes.createStructField("timestamp", DataTypes.StringType, > false), > DataTypes.createStructField("values", > DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false) > }; > StructField[] fields = { > DataTypes.createStructField("data", > DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false) > }; > return DataTypes.createStructType(fields); >} > } > {code} > The data file > {code:json} > { > "data": [ > { > "item_id": "item_1", > "timestamp": "2020-07-01 12:34:89", > "values": [ > { > "sample": 1.1 > }, > { > "sample": 1.2 > } > ] > }, > { > "item_id": "item_2", > "timestamp": "2020-07-02 12:34:89", > "values": [ > { > "sample": 2.2 > } > ] > } > ] > } > {code} > Dataset.show() method fails with an exception > {code:none} > Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in > [_gen_alias_28#28,_gen_alias_29#29] > at scala.sys.package$.error(package.scala:30) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 37 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32580) Issue accessing a column values after 'explode' function
[ https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32580: - Priority: Major (was: Blocker) > Issue accessing a column values after 'explode' function > > > Key: SPARK-32580 > URL: https://issues.apache.org/jira/browse/SPARK-32580 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Ayrat Sadreev >Priority: Major > > An exception occurs when trying to flatten double nested arrays > The schema is > {code:none} > root > |-- data: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- item_id: string (nullable = true) > |||-- timestamp: string (nullable = true) > |||-- values: array (nullable = true) > ||||-- element: struct (containsNull = true) > |||||-- sample: double (nullable = true) > {code} > The target schema is > {code:none} > root > |-- item_id: string (nullable = true) > |-- timestamp: string (nullable = true) > |-- sample: double (nullable = true) > {code} > The code (in Java) > {code:java} > package com.skf.streamer.spark; > import java.util.concurrent.TimeoutException; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.functions; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > public class ExplodeTest { >public static void main(String[] args) throws TimeoutException { > SparkConf conf = new SparkConf() > .setAppName("SimpleApp") > .set("spark.scheduler.mode", "FAIR") > .set("spark.master", "local[1]") > .set("spark.sql.streaming.checkpointLocation", "checkpoint"); > SparkSession spark = SparkSession.builder() > .config(conf) > .getOrCreate(); > Dataset d0 = spark > .read() > .format("json") > .option("multiLine", "true") > .schema(getSchema()) > .load("src/test/resources/explode/data.json"); > d0.printSchema(); > d0 = d0.withColumn("item", functions.explode(d0.col("data"))); > d0 = d0.withColumn("value", functions.explode(d0.col("item.values"))); > d0.printSchema(); > d0 = d0.select( > d0.col("item.item_id"), > d0.col("item.timestamp"), > d0.col("value.sample") > ); > d0.printSchema(); > d0.show(); // Failes > spark.stop(); >} >private static StructType getSchema() { > StructField[] level2Fields = { > DataTypes.createStructField("sample", DataTypes.DoubleType, false), > }; > StructField[] level1Fields = { > DataTypes.createStructField("item_id", DataTypes.StringType, false), > DataTypes.createStructField("timestamp", DataTypes.StringType, > false), > DataTypes.createStructField("values", > DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false) > }; > StructField[] fields = { > DataTypes.createStructField("data", > DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false) > }; > return DataTypes.createStructType(fields); >} > } > {code} > The data file > {code:json} > { > "data": [ > { > "item_id": "item_1", > "timestamp": "2020-07-01 12:34:89", > "values": [ > { > "sample": 1.1 > }, > { > "sample": 1.2 > } > ] > }, > { > "item_id": "item_2", > "timestamp": "2020-07-02 12:34:89", > "values": [ > { > "sample": 2.2 > } > ] > } > ] > } > {code} > Dataset.show() method fails with an exception > {code:none} > Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in > [_gen_alias_28#28,_gen_alias_29#29] > at scala.sys.package$.error(package.scala:30) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 37 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32581) update duration property for live ui application list and application apis
[ https://issues.apache.org/jira/browse/SPARK-32581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32581: Assignee: Apache Spark > update duration property for live ui application list and application apis > -- > > Key: SPARK-32581 > URL: https://issues.apache.org/jira/browse/SPARK-32581 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Zhen Li >Assignee: Apache Spark >Priority: Trivial > Attachments: oldapi.JPG, updatedapiJPG.JPG > > > "duration" property in response from application list and application APIs of > live UI is always "0". we want to let these two APIs return correct value, > same with "*Total Uptime*" in live UI's job page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32581) update duration property for live ui application list and application apis
[ https://issues.apache.org/jira/browse/SPARK-32581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174283#comment-17174283 ] Apache Spark commented on SPARK-32581: -- User 'zhli1142015' has created a pull request for this issue: https://github.com/apache/spark/pull/29399 > update duration property for live ui application list and application apis > -- > > Key: SPARK-32581 > URL: https://issues.apache.org/jira/browse/SPARK-32581 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Zhen Li >Priority: Trivial > Attachments: oldapi.JPG, updatedapiJPG.JPG > > > "duration" property in response from application list and application APIs of > live UI is always "0". we want to let these two APIs return correct value, > same with "*Total Uptime*" in live UI's job page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32581) update duration property for live ui application list and application apis
[ https://issues.apache.org/jira/browse/SPARK-32581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32581: Assignee: (was: Apache Spark) > update duration property for live ui application list and application apis > -- > > Key: SPARK-32581 > URL: https://issues.apache.org/jira/browse/SPARK-32581 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Zhen Li >Priority: Trivial > Attachments: oldapi.JPG, updatedapiJPG.JPG > > > "duration" property in response from application list and application APIs of > live UI is always "0". we want to let these two APIs return correct value, > same with "*Total Uptime*" in live UI's job page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32581) update duration property for live ui application list and application apis
[ https://issues.apache.org/jira/browse/SPARK-32581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174286#comment-17174286 ] Apache Spark commented on SPARK-32581: -- User 'zhli1142015' has created a pull request for this issue: https://github.com/apache/spark/pull/29399 > update duration property for live ui application list and application apis > -- > > Key: SPARK-32581 > URL: https://issues.apache.org/jira/browse/SPARK-32581 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Zhen Li >Priority: Trivial > Attachments: oldapi.JPG, updatedapiJPG.JPG > > > "duration" property in response from application list and application APIs of > live UI is always "0". we want to let these two APIs return correct value, > same with "*Total Uptime*" in live UI's job page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32581) update duration property for live ui application list and application apis
[ https://issues.apache.org/jira/browse/SPARK-32581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhen Li updated SPARK-32581: Attachment: updatedapiJPG.JPG oldapi.JPG > update duration property for live ui application list and application apis > -- > > Key: SPARK-32581 > URL: https://issues.apache.org/jira/browse/SPARK-32581 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Zhen Li >Priority: Trivial > Attachments: oldapi.JPG, updatedapiJPG.JPG > > > "duration" property in response from application list and application APIs of > live UI is always "0". we want to let these two APIs return correct value, > same with "*Total Uptime*" in live UI's job page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32581) update duration property for live ui application list and application apis
Zhen Li created SPARK-32581: --- Summary: update duration property for live ui application list and application apis Key: SPARK-32581 URL: https://issues.apache.org/jira/browse/SPARK-32581 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: Zhen Li "duration" property in response from application list and application APIs of live UI is always "0". we want to let these two APIs return correct value, same with "*Total Uptime*" in live UI's job page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames
[ https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174263#comment-17174263 ] Takeshi Yamamuro edited comment on SPARK-29767 at 8/10/20, 11:48 AM: - I checked the given query and I think your union query looks incorrect. All the branches (master/branch-3.0/branch-2.4) have the same error; {code:java} : org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 5 columns and the second table has 20 columns;; {code} Actually, the schemas are different; {code:java} >>> base_df.printSchema() root |-- hour: long (nullable = true) |-- title: string (nullable = true) |-- __deleted: string (nullable = true) |-- status: string (nullable = true) |-- transformationid: string (nullable = true) |-- roomid: string (nullable = true) |-- day: long (nullable = true) |-- notes: string (nullable = true) |-- nunitsfromaudit: long (nullable = true) |-- ts_ms: long (nullable = true) |-- liability: string (nullable = true) |-- _class: string (nullable = true) |-- month: long (nullable = true) |-- updatedate: struct (nullable = true) | |-- date: long (nullable = true) |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- year: long (nullable = true) |-- item: struct (nullable = true) | |-- name: string (nullable = true) | |-- brandname: string (nullable = true) | |-- perunitpricefromaudit: struct (nullable = true) | | |-- currency: string (nullable = true) | | |-- amount: string (nullable = true) | |-- actualperunitprice: struct (nullable = true) | | |-- currency: string (nullable = true) | | |-- amount: string (nullable = true) | |-- category: string (nullable = true) | |-- itemtype: string (nullable = true) | |-- roomamenityid: long (nullable = true) |-- createddate: struct (nullable = true) | |-- date: long (nullable = true) |-- actualunits: long (nullable = true) |-- description: string (nullable = true) >>> inc_df.printSchema() root |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- _class: string (nullable = true) |-- roomid: string (nullable = true) |-- item: struct (nullable = true) | |-- name: string (nullable = true) | |-- brandname: string (nullable = true) | |-- perunitpricefromaudit: struct (nullable = true) | | |-- currency: string (nullable = true) | | |-- amount: string (nullable = true) | |-- actualperunitprice: struct (nullable = true) | | |-- currency: string (nullable = true) | | |-- amount: string (nullable = true) | |-- category: string (nullable = true) | |-- itemtype: string (nullable = true) | |-- roomamenityid: long (nullable = true) |-- inserted_at: string (nullable = true) {code} NOTE: It is the best to show us a simpler query to reproduce a issue. So, if possible, its better to remove unnecessary parts, e.g., columns, for saving developer's time. was (Author: maropu): I checked the given query and I think your union query looks incorrect. All the branches (master/branch-3.0/branch-2.4) have the same error; {code:java} : org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 5 columns and the second table has 20 columns;; {code} Actually, the schema are different; {code:java} >>> base_df.printSchema() root |-- hour: long (nullable = true) |-- title: string (nullable = true) |-- __deleted: string (nullable = true) |-- status: string (nullable = true) |-- transformationid: string (nullable = true) |-- roomid: string (nullable = true) |-- day: long (nullable = true) |-- notes: string (nullable = true) |-- nunitsfromaudit: long (nullable = true) |-- ts_ms: long (nullable = true) |-- liability: string (nullable = true) |-- _class: string (nullable = true) |-- month: long (nullable = true) |-- updatedate: struct (nullable = true) | |-- date: long (nullable = true) |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- year: long (nullable = true) |-- item: struct (nullable = true) | |-- name: string (nullable = true) | |-- brandname: string (nullable = true) | |-- perunitpricefromaudit: struct (nullable = true) | | |-- currency: string (nullable = true) | | |-- amount: string (nullable = true) | |-- actualperunitprice: struct (nullable = true) | | |-- currency: string (nullable = true) | | |-- amount: string (nullable = true) | |-- category: string (nullable = true) | |-- itemtype: string (nullable = true) | |-- roomamenityid: long (nullable = true) |-- createddate: struct (nullable = true) | |-- date: long (nullable = true) |-- actualunits: long (nullable = true) |-- description: string
[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames
[ https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174263#comment-17174263 ] Takeshi Yamamuro commented on SPARK-29767: -- I checked the given query and I think your union query looks incorrect. All the branches (master/branch-3.0/branch-2.4) have the same error; {code:java} : org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 5 columns and the second table has 20 columns;; {code} Actually, the schema are different; {code:java} >>> base_df.printSchema() root |-- hour: long (nullable = true) |-- title: string (nullable = true) |-- __deleted: string (nullable = true) |-- status: string (nullable = true) |-- transformationid: string (nullable = true) |-- roomid: string (nullable = true) |-- day: long (nullable = true) |-- notes: string (nullable = true) |-- nunitsfromaudit: long (nullable = true) |-- ts_ms: long (nullable = true) |-- liability: string (nullable = true) |-- _class: string (nullable = true) |-- month: long (nullable = true) |-- updatedate: struct (nullable = true) | |-- date: long (nullable = true) |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- year: long (nullable = true) |-- item: struct (nullable = true) | |-- name: string (nullable = true) | |-- brandname: string (nullable = true) | |-- perunitpricefromaudit: struct (nullable = true) | | |-- currency: string (nullable = true) | | |-- amount: string (nullable = true) | |-- actualperunitprice: struct (nullable = true) | | |-- currency: string (nullable = true) | | |-- amount: string (nullable = true) | |-- category: string (nullable = true) | |-- itemtype: string (nullable = true) | |-- roomamenityid: long (nullable = true) |-- createddate: struct (nullable = true) | |-- date: long (nullable = true) |-- actualunits: long (nullable = true) |-- description: string (nullable = true) >>> inc_df.printSchema() root |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- _class: string (nullable = true) |-- roomid: string (nullable = true) |-- item: struct (nullable = true) | |-- name: string (nullable = true) | |-- brandname: string (nullable = true) | |-- perunitpricefromaudit: struct (nullable = true) | | |-- currency: string (nullable = true) | | |-- amount: string (nullable = true) | |-- actualperunitprice: struct (nullable = true) | | |-- currency: string (nullable = true) | | |-- amount: string (nullable = true) | |-- category: string (nullable = true) | |-- itemtype: string (nullable = true) | |-- roomamenityid: long (nullable = true) |-- inserted_at: string (nullable = true) {code} NOTE: It is the best to show us a simpler query to reproduce a issue. So, if possible, its better to remove unnecessary parts, e.g., columns, for saving developer's time. > Core dump happening on executors while doing simple union of Data Frames > > > Key: SPARK-29767 > URL: https://issues.apache.org/jira/browse/SPARK-29767 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.4 > Environment: AWS EMR 5.27.0, Spark 2.4.4 >Reporter: Udit Mehrotra >Priority: Major > Attachments: coredump.zip, hs_err_pid13885.log, > part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet, test.py > > > Running a union operation on two DataFrames through both Scala Spark Shell > and PySpark, resulting in executor contains doing a *core dump* and existing > with Exit code 134. > The trace from the *Driver*: > {noformat} > Container exited with a non-zero exit code 134 > . > 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; > aborting job > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure > (executor 11 exited caused by one of the running tasks) Reason: Container > from a bad node: container_1572981097605_0021_01_77 on host: > ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_1572981097605_0021_01_77 > Exit code: 134 > Exception message: /bin/bash: line 1: 12611 Aborted > LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" >
[jira] [Created] (SPARK-32580) Issue accessing a column values after 'explode' function
Ayrat Sadreev created SPARK-32580: - Summary: Issue accessing a column values after 'explode' function Key: SPARK-32580 URL: https://issues.apache.org/jira/browse/SPARK-32580 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 3.0.0 Reporter: Ayrat Sadreev An exception occurs when trying to flatten double nested arrays The schema is {code:none} root |-- data: array (nullable = true) ||-- element: struct (containsNull = true) |||-- item_id: string (nullable = true) |||-- timestamp: string (nullable = true) |||-- values: array (nullable = true) ||||-- element: struct (containsNull = true) |||||-- sample: double (nullable = true) {code} The target schema is {code:none} root |-- item_id: string (nullable = true) |-- timestamp: string (nullable = true) |-- sample: double (nullable = true) {code} The code (in Java) {code:java} package com.skf.streamer.spark; import java.util.concurrent.TimeoutException; import org.apache.spark.SparkConf; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.functions; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; public class ExplodeTest { public static void main(String[] args) throws TimeoutException { SparkConf conf = new SparkConf() .setAppName("SimpleApp") .set("spark.scheduler.mode", "FAIR") .set("spark.master", "local[1]") .set("spark.sql.streaming.checkpointLocation", "checkpoint"); SparkSession spark = SparkSession.builder() .config(conf) .getOrCreate(); Dataset d0 = spark .read() .format("json") .option("multiLine", "true") .schema(getSchema()) .load("src/test/resources/explode/data.json"); d0.printSchema(); d0 = d0.withColumn("item", functions.explode(d0.col("data"))); d0 = d0.withColumn("value", functions.explode(d0.col("item.values"))); d0.printSchema(); d0 = d0.select( d0.col("item.item_id"), d0.col("item.timestamp"), d0.col("value.sample") ); d0.printSchema(); d0.show(); // Failes spark.stop(); } private static StructType getSchema() { StructField[] level2Fields = { DataTypes.createStructField("sample", DataTypes.DoubleType, false), }; StructField[] level1Fields = { DataTypes.createStructField("item_id", DataTypes.StringType, false), DataTypes.createStructField("timestamp", DataTypes.StringType, false), DataTypes.createStructField("values", DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false) }; StructField[] fields = { DataTypes.createStructField("data", DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false) }; return DataTypes.createStructType(fields); } } {code} The data file {code:json} { "data": [ { "item_id": "item_1", "timestamp": "2020-07-01 12:34:89", "values": [ { "sample": 1.1 }, { "sample": 1.2 } ] }, { "item_id": "item_2", "timestamp": "2020-07-02 12:34:89", "values": [ { "sample": 2.2 } ] } ] } {code} Dataset.show() method fails with an exception {code:none} Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in [_gen_alias_28#28,_gen_alias_29#29] at scala.sys.package$.error(package.scala:30) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 37 more {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32290) NotInSubquery SingleColumn Optimize
[ https://issues.apache.org/jira/browse/SPARK-32290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-32290: Priority: Major (was: Minor) > NotInSubquery SingleColumn Optimize > --- > > Key: SPARK-32290 > URL: https://issues.apache.org/jira/browse/SPARK-32290 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Assignee: Leanken.Lin >Priority: Major > Fix For: 3.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Normally, > A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very > very time consuming. For example, I've done TPCH benchmark lately, Query 16 > almost took half of the entire TPCH 22Query execution Time. So i proposed > that to do the following optimize. > Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only > single column in following pattern. > {code:java} > case _@Or( > _@EqualTo(leftAttr: AttributeReference, rightAttr: > AttributeReference), > _@IsNull( > _@EqualTo(_: AttributeReference, _: AttributeReference) > ) > ) > {code} > if buildSide rows is small enough, we can change build side data into a > HashMap. > so the M*N calculation can be optimized into M*log(N) > I've done a benchmark job in 1TB TPCH, before apply the optimize > Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it > takes only 30s to finish. > But this optimize only works on single column not in subquery, so i am here > to seek advise whether the community need this update or not. I will do the > pull request first, if the community member thought it's hack, it's fine to > just ignore this request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32409) Document the dependency between spark.metrics.staticSources.enabled and JVMSource registration
[ https://issues.apache.org/jira/browse/SPARK-32409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-32409: Summary: Document the dependency between spark.metrics.staticSources.enabled and JVMSource registration (was: Remove dependency between spark.metrics.staticSources.enabled and JVMSource registration) > Document the dependency between spark.metrics.staticSources.enabled and > JVMSource registration > -- > > Key: SPARK-32409 > URL: https://issues.apache.org/jira/browse/SPARK-32409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > SPARK-29654 has introduced the configuration > `spark.metrics.register.static.sources.enabled`. > The current implementation of SPARK-29654 and SPARK-25277 have, as side > effect, that when {{spark.metrics.register.static.sources.enabled}} is set to > false, the registration of JVM Source is also ignored, even if requested with > {{spark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource}} > A PR is proposed to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32573) Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys
[ https://issues.apache.org/jira/browse/SPARK-32573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32573: Description: In SPARK-32290, we introduced several new types of HashedRelation * EmptyHashedRelation * EmptyHashedRelationWithAllNullKeys They were all limited to used only in NAAJ scenario. These new HashedRelation could be applied to other scenario for performance improvements. * EmptyHashedRelation could also be used in Normal AntiJoin for fast stop * While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can convert NAAJ to a Empty LocalRelation to skip meaningless data iteration since in Single-Key NAAJ, if null key exists in BuildSide, will drop all records in streamedSide. This Patch including two changes. * using EmptyHashedRelation to do fast stop for common anti join as well * In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a EmptyHashedRelationWithAllNullKeys was: In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we introduced several new types of HashedRelation * EmptyHashedRelation * EmptyHashedRelationWithAllNullKeys They were all limited to used only in NAAJ scenario. But as for a improvement, EmptyHashedRelation could also be used in Normal AntiJoin for fast stop, and as for in AQE, we can even eliminate anti join when we knew that buildSide is empty. This Patch including two changes. In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as well In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of ShuffleWriteRecord is 0 Summary: Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys (was: Eliminate Anti Join when BuildSide is Empty) > Anti Join Improvement with EmptyHashedRelation and > EmptyHashedRelationWithAllNullKeys > - > > Key: SPARK-32573 > URL: https://issues.apache.org/jira/browse/SPARK-32573 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > > In SPARK-32290, we introduced several new types of HashedRelation > * EmptyHashedRelation > * EmptyHashedRelationWithAllNullKeys > They were all limited to used only in NAAJ scenario. These new HashedRelation > could be applied to other scenario for performance improvements. > * EmptyHashedRelation could also be used in Normal AntiJoin for fast stop > * While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can > convert NAAJ to a Empty LocalRelation to skip meaningless data iteration > since in Single-Key NAAJ, if null key exists in BuildSide, will drop all > records in streamedSide. > This Patch including two changes. > * using EmptyHashedRelation to do fast stop for common anti join as well > * In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a > EmptyHashedRelationWithAllNullKeys -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32576) Support PostgreSQL `bpchar` type and array of char type
[ https://issues.apache.org/jira/browse/SPARK-32576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174152#comment-17174152 ] Apache Spark commented on SPARK-32576: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/29397 > Support PostgreSQL `bpchar` type and array of char type > --- > > Key: SPARK-32576 > URL: https://issues.apache.org/jira/browse/SPARK-32576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jakub Korzeniowski >Assignee: Jakub Korzeniowski >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > Attempting to read the following table: > {code:java} > CREATE TABLE test_table ( > test_column char(64)[] > ) > {code} > results in the following exception: > {code:java} > java.sql.SQLException: Unsupported type ARRAY > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:256) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203) > {code} > non-array and varchar equivalents are fine. > > I've tracked it down to an internal function of the postgres dialect, that > accounts for the special 1-byte char, but doesn't deal with different length > ones, which postgres represents as bpchar: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L60-L61]. > Relevant driver code can be found here: > [https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/TypeInfoCache.java#L85-L87] > > I'll submit a fix shortly -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32576) Support PostgreSQL `bpchar` type and array of char type
[ https://issues.apache.org/jira/browse/SPARK-32576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174151#comment-17174151 ] Apache Spark commented on SPARK-32576: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/29397 > Support PostgreSQL `bpchar` type and array of char type > --- > > Key: SPARK-32576 > URL: https://issues.apache.org/jira/browse/SPARK-32576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jakub Korzeniowski >Assignee: Jakub Korzeniowski >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > Attempting to read the following table: > {code:java} > CREATE TABLE test_table ( > test_column char(64)[] > ) > {code} > results in the following exception: > {code:java} > java.sql.SQLException: Unsupported type ARRAY > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:256) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203) > {code} > non-array and varchar equivalents are fine. > > I've tracked it down to an internal function of the postgres dialect, that > accounts for the special 1-byte char, but doesn't deal with different length > ones, which postgres represents as bpchar: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L60-L61]. > Relevant driver code can be found here: > [https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/TypeInfoCache.java#L85-L87] > > I'll submit a fix shortly -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32579) Implement JDBCScan/ScanBuilder/WriteBuilder
[ https://issues.apache.org/jira/browse/SPARK-32579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174129#comment-17174129 ] Apache Spark commented on SPARK-32579: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/29396 > Implement JDBCScan/ScanBuilder/WriteBuilder > --- > > Key: SPARK-32579 > URL: https://issues.apache.org/jira/browse/SPARK-32579 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Add JDBCScan, JDBCScanBuilder and JDBCWriteBuilder to Datasource V2 JDBC > implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32579) Implement JDBCScan/ScanBuilder/WriteBuilder
[ https://issues.apache.org/jira/browse/SPARK-32579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174128#comment-17174128 ] Apache Spark commented on SPARK-32579: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/29396 > Implement JDBCScan/ScanBuilder/WriteBuilder > --- > > Key: SPARK-32579 > URL: https://issues.apache.org/jira/browse/SPARK-32579 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Add JDBCScan, JDBCScanBuilder and JDBCWriteBuilder to Datasource V2 JDBC > implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32579) Implement JDBCScan/ScanBuilder/WriteBuilder
[ https://issues.apache.org/jira/browse/SPARK-32579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32579: Assignee: Apache Spark > Implement JDBCScan/ScanBuilder/WriteBuilder > --- > > Key: SPARK-32579 > URL: https://issues.apache.org/jira/browse/SPARK-32579 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > > Add JDBCScan, JDBCScanBuilder and JDBCWriteBuilder to Datasource V2 JDBC > implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32579) Implement JDBCScan/ScanBuilder/WriteBuilder
[ https://issues.apache.org/jira/browse/SPARK-32579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32579: Assignee: (was: Apache Spark) > Implement JDBCScan/ScanBuilder/WriteBuilder > --- > > Key: SPARK-32579 > URL: https://issues.apache.org/jira/browse/SPARK-32579 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Add JDBCScan, JDBCScanBuilder and JDBCWriteBuilder to Datasource V2 JDBC > implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32579) Implement JDBCScan/ScanBuilder/WriteBuilder
Huaxin Gao created SPARK-32579: -- Summary: Implement JDBCScan/ScanBuilder/WriteBuilder Key: SPARK-32579 URL: https://issues.apache.org/jira/browse/SPARK-32579 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Huaxin Gao Add JDBCScan, JDBCScanBuilder and JDBCWriteBuilder to Datasource V2 JDBC implementation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32518) CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources
[ https://issues.apache.org/jira/browse/SPARK-32518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174113#comment-17174113 ] Apache Spark commented on SPARK-32518: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/29395 > CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds > of resources > -- > > Key: SPARK-32518 > URL: https://issues.apache.org/jira/browse/SPARK-32518 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > Currently, CoarseGrainedSchedulerBackend.maxNumConcurrentTasks only considers > the CPU for the max concurrent tasks. This can cause the application to hang > when a barrier stage requires extra custom resources but the cluster doesn't > have enough corresponding resources. Because, without the checking for other > custom resources in maxNumConcurrentTasks, the barrier stage can be submitted > to the TaskSchedulerImpl. But the TaskSchedulerImpl can not launch tasks for > the barrier stage due to the insufficient task slots calculated by > calculateAvailableSlots(which does check all kinds of resources). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32191) Migration Guide
[ https://issues.apache.org/jira/browse/SPARK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32191. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29385 [https://github.com/apache/spark/pull/29385] > Migration Guide > --- > > Key: SPARK-32191 > URL: https://issues.apache.org/jira/browse/SPARK-32191 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > Port http://spark.apache.org/docs/latest/pyspark-migration-guide.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32191) Migration Guide
[ https://issues.apache.org/jira/browse/SPARK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32191: Assignee: L. C. Hsieh > Migration Guide > --- > > Key: SPARK-32191 > URL: https://issues.apache.org/jira/browse/SPARK-32191 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: L. C. Hsieh >Priority: Major > > Port http://spark.apache.org/docs/latest/pyspark-migration-guide.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32578) PageRank not sending the correct values in Pergel sendMessage
[ https://issues.apache.org/jira/browse/SPARK-32578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shay Elbaz updated SPARK-32578: --- Description: The core sendMessage method is incorrect: {code:java} def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = { if (edge.srcAttr._2 > tol) { Iterator((edge.dstId, edge.srcAttr._2 * edge.attr)) // *** THIS ^ *** } else { Iterator.empty } }{code} Instead of using the source PR value, it's using the PR delta (2nd tuple arg). This is not the documented behavior, nor a valid PR algorithm AFAIK. This is a 7 years old code, all versions affected. was: The core sendMessage method is incorrect: {code:java} def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = { if (edge.srcAttr._2 > tol) { Iterator((edge.dstId, edge.srcAttr._2 * edge.attr)) // THIS ^ } else { Iterator.empty } }{code} Instead of sending the source PR value, it sends the PR delta. This is not the documented behavior, nor a valid PR algorithm AFAIK. This is a 7 years old code, all versions affected. > PageRank not sending the correct values in Pergel sendMessage > - > > Key: SPARK-32578 > URL: https://issues.apache.org/jira/browse/SPARK-32578 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.3.0, 2.4.0, 3.0.0 >Reporter: Shay Elbaz >Priority: Major > > The core sendMessage method is incorrect: > {code:java} > def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = { > if (edge.srcAttr._2 > tol) { >Iterator((edge.dstId, edge.srcAttr._2 * edge.attr)) > // *** THIS ^ *** > } else { >Iterator.empty > } > }{code} > > Instead of using the source PR value, it's using the PR delta (2nd tuple > arg). This is not the documented behavior, nor a valid PR algorithm AFAIK. > This is a 7 years old code, all versions affected. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32578) PageRank not sending the correct values in Pergel sendMessage
Shay Elbaz created SPARK-32578: -- Summary: PageRank not sending the correct values in Pergel sendMessage Key: SPARK-32578 URL: https://issues.apache.org/jira/browse/SPARK-32578 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 3.0.0, 2.4.0, 2.3.0 Reporter: Shay Elbaz The core sendMessage method is incorrect: {code:java} def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = { if (edge.srcAttr._2 > tol) { Iterator((edge.dstId, edge.srcAttr._2 * edge.attr)) // THIS ^ } else { Iterator.empty } }{code} Instead of sending the source PR value, it sends the PR delta. This is not the documented behavior, nor a valid PR algorithm AFAIK. This is a 7 years old code, all versions affected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32337) Show initial plan in AQE plan tree string
[ https://issues.apache.org/jira/browse/SPARK-32337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32337: --- Assignee: Allison Wang > Show initial plan in AQE plan tree string > - > > Key: SPARK-32337 > URL: https://issues.apache.org/jira/browse/SPARK-32337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Minor > Fix For: 3.1.0 > > > Currently the tree string for {{AdaptiveSparkPlanExec}} only shows the > current physical plan. It would be helpful to also show the initial plan to > see the plan change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org