[jira] [Assigned] (SPARK-30532) DataFrameStatFunctions.approxQuantile doesn't work with TABLE.COLUMN syntax
[ https://issues.apache.org/jira/browse/SPARK-30532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30532: --- Assignee: Oleksii Kachaiev > DataFrameStatFunctions.approxQuantile doesn't work with TABLE.COLUMN syntax > --- > > Key: SPARK-30532 > URL: https://issues.apache.org/jira/browse/SPARK-30532 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Chris Suchanek >Assignee: Oleksii Kachaiev >Priority: Minor > Fix For: 3.0.0 > > > The DataFrameStatFunctions.approxQuantile doesn't work with fully qualified > column name (i.e TABLE_NAME.COLUMN_NAME) which is often the way you refer to > the column when working with joined dataframes having ambiguous column names. > See code below for example. > {code:java} > import scala.util.Random > val l = (0 to 1000).map(_ => Random.nextGaussian() * 1000) > val df1 = sc.parallelize(l).toDF("num").as("tt1") > val df2 = sc.parallelize(l).toDF("num").as("tt2") > val dfx = df2.crossJoin(df1) > dfx.stat.approxQuantile("tt1.num", Array(0.1), 0.0) > // throws: java.lang.IllegalArgumentException: Field "tt1.num" does not exist. > Available fields: num > dfx.stat.approxQuantile("num", Array(0.1), 0.0) > // throws: org.apache.spark.sql.AnalysisException: Reference 'num' is > ambiguous, could be: tt2.num, tt1.num.;{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30532) DataFrameStatFunctions.approxQuantile doesn't work with TABLE.COLUMN syntax
[ https://issues.apache.org/jira/browse/SPARK-30532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30532. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27916 [https://github.com/apache/spark/pull/27916] > DataFrameStatFunctions.approxQuantile doesn't work with TABLE.COLUMN syntax > --- > > Key: SPARK-30532 > URL: https://issues.apache.org/jira/browse/SPARK-30532 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Chris Suchanek >Priority: Minor > Fix For: 3.0.0 > > > The DataFrameStatFunctions.approxQuantile doesn't work with fully qualified > column name (i.e TABLE_NAME.COLUMN_NAME) which is often the way you refer to > the column when working with joined dataframes having ambiguous column names. > See code below for example. > {code:java} > import scala.util.Random > val l = (0 to 1000).map(_ => Random.nextGaussian() * 1000) > val df1 = sc.parallelize(l).toDF("num").as("tt1") > val df2 = sc.parallelize(l).toDF("num").as("tt2") > val dfx = df2.crossJoin(df1) > dfx.stat.approxQuantile("tt1.num", Array(0.1), 0.0) > // throws: java.lang.IllegalArgumentException: Field "tt1.num" does not exist. > Available fields: num > dfx.stat.approxQuantile("num", Array(0.1), 0.0) > // throws: org.apache.spark.sql.AnalysisException: Reference 'num' is > ambiguous, could be: tt2.num, tt1.num.;{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31283) Simplify ChiSq by adding a common method
[ https://issues.apache.org/jira/browse/SPARK-31283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-31283: Assignee: zhengruifeng > Simplify ChiSq by adding a common method > > > Key: SPARK-31283 > URL: https://issues.apache.org/jira/browse/SPARK-31283 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > The logic in '{color:#c7a65d}chiSquaredDenseFeatures{color}' and > '{color:#c7a65d}chiSquaredSparseFeatures{color}' can be unified. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31283) Simplify ChiSq by adding a common method
[ https://issues.apache.org/jira/browse/SPARK-31283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31283. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28045 [https://github.com/apache/spark/pull/28045] > Simplify ChiSq by adding a common method > > > Key: SPARK-31283 > URL: https://issues.apache.org/jira/browse/SPARK-31283 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.1.0 > > > The logic in '{color:#c7a65d}chiSquaredDenseFeatures{color}' and > '{color:#c7a65d}chiSquaredSparseFeatures{color}' can be unified. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31286) Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-31286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31286. - Resolution: Fixed Issue resolved by pull request 28051 [https://github.com/apache/spark/pull/28051] > Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp > - > > Key: SPARK-31286 > URL: https://issues.apache.org/jira/browse/SPARK-31286 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > There are two distinct types of ID (see > https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html): > # Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the > same offset for all local date-times > # Geographical regions - an area where a specific set of rules for finding > the offset from UTC/Greenwich apply > For example three-letter time zone IDs are ambitious, and depend on the > locale. They have been already deprecated in JDK, see > https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html : > {code} > For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such > as "PST", "CTT", "AST") are also supported. However, their use is deprecated > because the same abbreviation is often used for multiple time zones (for > example, "CST" could be U.S. "Central Standard Time" and "China Standard > Time"), and the Java platform can then only recognize one of them. > {code} > The ticket aims to specify formats of the `timeZone` option in JSON/CSV > datasource, and the `tz` parameter of the from_utc_timestamp() and > to_utc_timestamp() functions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070688#comment-17070688 ] Reynold Xin commented on SPARK-22231: - [~fqaiser94] thanks for your persistence and my apologies for the delay. You have my buy-in. This is a great idea. > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Jeremy Smith >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- a: integer (nullable = true) > // |||-- b: double (nullable = true) > result.show(false) > // +---++--+ > // |foo|bar |items | > //
[jira] [Updated] (SPARK-31291) SQLQueryTestSuite: Avoid load test data if test case not uses them.
[ https://issues.apache.org/jira/browse/SPARK-31291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-31291: --- Summary: SQLQueryTestSuite: Avoid load test data if test case not uses them. (was: SQLQueryTestSuiteAvoid load test data if test case not uses them) > SQLQueryTestSuite: Avoid load test data if test case not uses them. > --- > > Key: SPARK-31291 > URL: https://issues.apache.org/jira/browse/SPARK-31291 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Minor > > SQLQueryTestSuite spend 35 minutes time to test. > I checked the code and found SQLQueryTestSuite load test data repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31291) SQLQueryTestSuiteAvoid load test data if test case not uses them
[ https://issues.apache.org/jira/browse/SPARK-31291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-31291: --- Summary: SQLQueryTestSuiteAvoid load test data if test case not uses them (was: Avoid load test data if test case not uses them) > SQLQueryTestSuiteAvoid load test data if test case not uses them > > > Key: SPARK-31291 > URL: https://issues.apache.org/jira/browse/SPARK-31291 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Minor > > SQLQueryTestSuite spend 35 minutes time to test. > I checked the code and found SQLQueryTestSuite load test data repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31291) Avoid load test data if test case not uses them
[ https://issues.apache.org/jira/browse/SPARK-31291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31291: -- Parent: SPARK-25604 Issue Type: Sub-task (was: Improvement) > Avoid load test data if test case not uses them > --- > > Key: SPARK-31291 > URL: https://issues.apache.org/jira/browse/SPARK-31291 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Minor > > SQLQueryTestSuite spend 35 minutes time to test. > I checked the code and found SQLQueryTestSuite load test data repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31291) Avoid load test data if test case not uses them
[ https://issues.apache.org/jira/browse/SPARK-31291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31291: -- Priority: Minor (was: Major) > Avoid load test data if test case not uses them > --- > > Key: SPARK-31291 > URL: https://issues.apache.org/jira/browse/SPARK-31291 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Minor > > SQLQueryTestSuite spend 35 minutes time to test. > I checked the code and found SQLQueryTestSuite load test data repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31291) Avoid load test data if test case not uses them
[ https://issues.apache.org/jira/browse/SPARK-31291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31291: -- Component/s: Tests > Avoid load test data if test case not uses them > --- > > Key: SPARK-31291 > URL: https://issues.apache.org/jira/browse/SPARK-31291 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > SQLQueryTestSuite spend 35 minutes time to test. > I checked the code and found SQLQueryTestSuite load test data repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31101) Upgrade Janino to 3.0.16
[ https://issues.apache.org/jira/browse/SPARK-31101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31101: -- Fix Version/s: (was: 3.1.0) > Upgrade Janino to 3.0.16 > > > Key: SPARK-31101 > URL: https://issues.apache.org/jira/browse/SPARK-31101 > Project: Spark > Issue Type: Dependency upgrade > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0, 2.4.6 > > > We got some report on failure on user's query which Janino throws error on > compiling generated code. The issue is here: janino-compiler/janino#113 It > contains the information of generated code, symptom (error), and analysis of > the bug, so please refer the link for more details. > Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable > Janino to succeed to compile user's query properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31101) Upgrade Janino to 3.0.16
[ https://issues.apache.org/jira/browse/SPARK-31101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31101: -- Fix Version/s: 2.4.6 3.0.0 > Upgrade Janino to 3.0.16 > > > Key: SPARK-31101 > URL: https://issues.apache.org/jira/browse/SPARK-31101 > Project: Spark > Issue Type: Dependency upgrade > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0, 3.1.0, 2.4.6 > > > We got some report on failure on user's query which Janino throws error on > compiling generated code. The issue is here: janino-compiler/janino#113 It > contains the information of generated code, symptom (error), and analysis of > the bug, so please refer the link for more details. > Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable > Janino to succeed to compile user's query properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31101) Upgrade Janino to 3.0.16
[ https://issues.apache.org/jira/browse/SPARK-31101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31101: -- Issue Type: Bug (was: Dependency upgrade) > Upgrade Janino to 3.0.16 > > > Key: SPARK-31101 > URL: https://issues.apache.org/jira/browse/SPARK-31101 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0, 2.4.6 > > > We got some report on failure on user's query which Janino throws error on > compiling generated code. The issue is here: janino-compiler/janino#113 It > contains the information of generated code, symptom (error), and analysis of > the bug, so please refer the link for more details. > Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable > Janino to succeed to compile user's query properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31101) Upgrade Janino to 3.0.16
[ https://issues.apache.org/jira/browse/SPARK-31101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31101: -- Component/s: (was: SQL) Build > Upgrade Janino to 3.0.16 > > > Key: SPARK-31101 > URL: https://issues.apache.org/jira/browse/SPARK-31101 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0, 2.4.6 > > > We got some report on failure on user's query which Janino throws error on > compiling generated code. The issue is here: janino-compiler/janino#113 It > contains the information of generated code, symptom (error), and analysis of > the bug, so please refer the link for more details. > Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable > Janino to succeed to compile user's query properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31101) Upgrade Janino to 3.0.16
[ https://issues.apache.org/jira/browse/SPARK-31101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31101: -- Affects Version/s: (was: 3.1.0) 3.0.0 > Upgrade Janino to 3.0.16 > > > Key: SPARK-31101 > URL: https://issues.apache.org/jira/browse/SPARK-31101 > Project: Spark > Issue Type: Dependency upgrade > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0, 3.1.0, 2.4.6 > > > We got some report on failure on user's query which Janino throws error on > compiling generated code. The issue is here: janino-compiler/janino#113 It > contains the information of generated code, symptom (error), and analysis of > the bug, so please refer the link for more details. > Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable > Janino to succeed to compile user's query properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31293) Fix wrong examples and help messages for Kinesis integration
[ https://issues.apache.org/jira/browse/SPARK-31293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31293: -- Affects Version/s: 2.3.4 > Fix wrong examples and help messages for Kinesis integration > > > Key: SPARK-31293 > URL: https://issues.apache.org/jira/browse/SPARK-31293 > Project: Spark > Issue Type: Bug > Components: Documentation, DStreams >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Minor > Fix For: 3.0.0, 2.4.6 > > > There are some minor mistakes in the examples and the help messages for > Kinesis integration. For example, {{KinesisWordCountASL.scala}} takes three > arguments but its example is taking four, while {{kinesis_wordcount_asl.py}} > takes four but its example is taking three. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31293) Fix wrong examples and help messages for Kinesis integration
[ https://issues.apache.org/jira/browse/SPARK-31293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31293: - Assignee: Kengo Seki > Fix wrong examples and help messages for Kinesis integration > > > Key: SPARK-31293 > URL: https://issues.apache.org/jira/browse/SPARK-31293 > Project: Spark > Issue Type: Bug > Components: Documentation, DStreams >Affects Versions: 3.0.0 >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Minor > > There are some minor mistakes in the examples and the help messages for > Kinesis integration. For example, {{KinesisWordCountASL.scala}} takes three > arguments but its example is taking four, while {{kinesis_wordcount_asl.py}} > takes four but its example is taking three. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31293) Fix wrong examples and help messages for Kinesis integration
[ https://issues.apache.org/jira/browse/SPARK-31293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31293. --- Fix Version/s: 3.0.0 2.4.6 Resolution: Fixed Issue resolved by pull request 28063 [https://github.com/apache/spark/pull/28063] > Fix wrong examples and help messages for Kinesis integration > > > Key: SPARK-31293 > URL: https://issues.apache.org/jira/browse/SPARK-31293 > Project: Spark > Issue Type: Bug > Components: Documentation, DStreams >Affects Versions: 3.0.0 >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Minor > Fix For: 2.4.6, 3.0.0 > > > There are some minor mistakes in the examples and the help messages for > Kinesis integration. For example, {{KinesisWordCountASL.scala}} takes three > arguments but its example is taking four, while {{kinesis_wordcount_asl.py}} > takes four but its example is taking three. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31293) Fix wrong examples and help messages for Kinesis integration
[ https://issues.apache.org/jira/browse/SPARK-31293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31293: -- Affects Version/s: 2.4.5 > Fix wrong examples and help messages for Kinesis integration > > > Key: SPARK-31293 > URL: https://issues.apache.org/jira/browse/SPARK-31293 > Project: Spark > Issue Type: Bug > Components: Documentation, DStreams >Affects Versions: 2.4.5, 3.0.0 >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Minor > Fix For: 3.0.0, 2.4.6 > > > There are some minor mistakes in the examples and the help messages for > Kinesis integration. For example, {{KinesisWordCountASL.scala}} takes three > arguments but its example is taking four, while {{kinesis_wordcount_asl.py}} > takes four but its example is taking three. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31293) Fix wrong examples and help messages for Kinesis integration
[ https://issues.apache.org/jira/browse/SPARK-31293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31293: -- Issue Type: Bug (was: Improvement) > Fix wrong examples and help messages for Kinesis integration > > > Key: SPARK-31293 > URL: https://issues.apache.org/jira/browse/SPARK-31293 > Project: Spark > Issue Type: Bug > Components: Documentation, DStreams >Affects Versions: 3.0.0 >Reporter: Kengo Seki >Priority: Minor > > There are some minor mistakes in the examples and the help messages for > Kinesis integration. For example, {{KinesisWordCountASL.scala}} takes three > arguments but its example is taking four, while {{kinesis_wordcount_asl.py}} > takes four but its example is taking three. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31281) Hit OOM Error - GC Limit
[ https://issues.apache.org/jira/browse/SPARK-31281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070478#comment-17070478 ] Alfred Davidson edited comment on SPARK-31281 at 3/29/20, 6:54 PM: --- The allocated driver memory will be split for storage, memoryOverhead etc. Your transformation is doing a join (which is likely to be a broadcast join) and you have an action that is bringing the data to the driver - the driver doesn’t have enough memory (and initially trying to GC to free up space). You can either allocate more driver memory or change the fraction that it allocations for storage. I believe default value is 0.6 e.g reserves 60% of driver memory for storage was (Author: alfiewdavidson): The allocated driver memory will be split for storage, memoryOverhead etc. As your action is bringing the data to the driver - the driver doesn’t have enough memory (and initially trying to GC to free up space). You can either allocate more driver memory or change the fraction that it allocations for storage. I believe default value is 0.6 e.g reserves 60% of driver memory for storage > Hit OOM Error - GC Limit > > > Key: SPARK-31281 > URL: https://issues.apache.org/jira/browse/SPARK-31281 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.4.4 >Reporter: HongJin >Priority: Critical > > MemoryStore is 2.6GB > conf = new SparkConf().setAppName("test") > //.set("spark.sql.codegen.wholeStage", "false") > .set("spark.driver.host", "localhost") > .set("spark.driver.memory", "4g") > .set("spark.executor.cores","1") > .set("spark.num.executors","1") > .set("spark.executor.memory", "4g") > .set("spark.executor.memoryOverhead", "400m") > .set("spark.dynamicAllocation.enabled", "true") > .set("spark.dynamicAllocation.minExecutors","1") > .set("spark.dynamicAllocation.maxExecutors","2") > .set("spark.ui.enabled","true") //enable spark UI > .set("spark.sql.shuffle.partitions",defaultPartitions) > .setMaster("local[2]") > sparkSession = SparkSession.builder.config(conf).getOrCreate() > > val df = SparkFactory.sparkSession.sqlContext > .read > .option("header", "true") > .option("delimiter", delimiter) > .csv(textFileLocation) > > joinedDf = upperCaseLeft.as("l") > .join(upperCaseRight.as("r"), caseTransformedKeys, "full_outer") > .select(compositeKeysCol ::: nonKeyCols.map(col => > mapHelper(col,toleranceValue,caseSensitive)): _*) > > data = joinedDf.take(maxRecords) > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31281) Hit OOM Error - GC Limit
[ https://issues.apache.org/jira/browse/SPARK-31281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070478#comment-17070478 ] Alfred Davidson edited comment on SPARK-31281 at 3/29/20, 6:54 PM: --- The allocated driver memory will be split for storage, memoryOverhead etc. You are executing a join (which is likely to be a broadcast join) and you have an action that is bringing the data to the driver - the driver doesn’t have enough memory (and initially trying to GC to free up space). You can either allocate more driver memory or change the fraction that it allocations for storage. I believe default value is 0.6 e.g reserves 60% of driver memory for storage was (Author: alfiewdavidson): The allocated driver memory will be split for storage, memoryOverhead etc. Your transformation is doing a join (which is likely to be a broadcast join) and you have an action that is bringing the data to the driver - the driver doesn’t have enough memory (and initially trying to GC to free up space). You can either allocate more driver memory or change the fraction that it allocations for storage. I believe default value is 0.6 e.g reserves 60% of driver memory for storage > Hit OOM Error - GC Limit > > > Key: SPARK-31281 > URL: https://issues.apache.org/jira/browse/SPARK-31281 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.4.4 >Reporter: HongJin >Priority: Critical > > MemoryStore is 2.6GB > conf = new SparkConf().setAppName("test") > //.set("spark.sql.codegen.wholeStage", "false") > .set("spark.driver.host", "localhost") > .set("spark.driver.memory", "4g") > .set("spark.executor.cores","1") > .set("spark.num.executors","1") > .set("spark.executor.memory", "4g") > .set("spark.executor.memoryOverhead", "400m") > .set("spark.dynamicAllocation.enabled", "true") > .set("spark.dynamicAllocation.minExecutors","1") > .set("spark.dynamicAllocation.maxExecutors","2") > .set("spark.ui.enabled","true") //enable spark UI > .set("spark.sql.shuffle.partitions",defaultPartitions) > .setMaster("local[2]") > sparkSession = SparkSession.builder.config(conf).getOrCreate() > > val df = SparkFactory.sparkSession.sqlContext > .read > .option("header", "true") > .option("delimiter", delimiter) > .csv(textFileLocation) > > joinedDf = upperCaseLeft.as("l") > .join(upperCaseRight.as("r"), caseTransformedKeys, "full_outer") > .select(compositeKeysCol ::: nonKeyCols.map(col => > mapHelper(col,toleranceValue,caseSensitive)): _*) > > data = joinedDf.take(maxRecords) > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31281) Hit OOM Error - GC Limit
[ https://issues.apache.org/jira/browse/SPARK-31281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070478#comment-17070478 ] Alfred Davidson commented on SPARK-31281: - The allocated driver memory will be split for storage, memoryOverhead etc. As your action is bringing the data to the driver - the driver doesn’t have enough memory (and initially trying to GC to free up space). You can either allocate more driver memory or change the fraction that it allocations for storage. I believe default value is 0.6 e.g reserves 60% of driver memory for storage > Hit OOM Error - GC Limit > > > Key: SPARK-31281 > URL: https://issues.apache.org/jira/browse/SPARK-31281 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.4.4 >Reporter: HongJin >Priority: Critical > > MemoryStore is 2.6GB > conf = new SparkConf().setAppName("test") > //.set("spark.sql.codegen.wholeStage", "false") > .set("spark.driver.host", "localhost") > .set("spark.driver.memory", "4g") > .set("spark.executor.cores","1") > .set("spark.num.executors","1") > .set("spark.executor.memory", "4g") > .set("spark.executor.memoryOverhead", "400m") > .set("spark.dynamicAllocation.enabled", "true") > .set("spark.dynamicAllocation.minExecutors","1") > .set("spark.dynamicAllocation.maxExecutors","2") > .set("spark.ui.enabled","true") //enable spark UI > .set("spark.sql.shuffle.partitions",defaultPartitions) > .setMaster("local[2]") > sparkSession = SparkSession.builder.config(conf).getOrCreate() > > val df = SparkFactory.sparkSession.sqlContext > .read > .option("header", "true") > .option("delimiter", delimiter) > .csv(textFileLocation) > > joinedDf = upperCaseLeft.as("l") > .join(upperCaseRight.as("r"), caseTransformedKeys, "full_outer") > .select(compositeKeysCol ::: nonKeyCols.map(col => > mapHelper(col,toleranceValue,caseSensitive)): _*) > > data = joinedDf.take(maxRecords) > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31280) Perform propagating empty relation after RewritePredicateSubquery
[ https://issues.apache.org/jira/browse/SPARK-31280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31280. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28043 [https://github.com/apache/spark/pull/28043] > Perform propagating empty relation after RewritePredicateSubquery > - > > Key: SPARK-31280 > URL: https://issues.apache.org/jira/browse/SPARK-31280 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.1.0 > > > {code:java} > scala> spark.sql(" select * from values(1), (2) t(key) where key in (select 1 > as key where 1=0)").queryExecution > res15: org.apache.spark.sql.execution.QueryExecution = > == Parsed Logical Plan == > 'Project [*] > +- 'Filter 'key IN (list#39 []) >: +- Project [1 AS key#38] >: +- Filter (1 = 0) >:+- OneRowRelation >+- 'SubqueryAlias t > +- 'UnresolvedInlineTable [key], [List(1), List(2)] > == Analyzed Logical Plan == > key: int > Project [key#40] > +- Filter key#40 IN (list#39 []) >: +- Project [1 AS key#38] >: +- Filter (1 = 0) >:+- OneRowRelation >+- SubqueryAlias t > +- LocalRelation [key#40] > == Optimized Logical Plan == > Join LeftSemi, (key#40 = key#38) > :- LocalRelation [key#40] > +- LocalRelation , [key#38] > == Physical Plan == > *(1) BroadcastHashJoin [key#40], [key#38], LeftSemi, BuildRight > :- *(1) LocalTableScan [key#40] > +- Br... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31280) Perform propagating empty relation after RewritePredicateSubquery
[ https://issues.apache.org/jira/browse/SPARK-31280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31280: - Assignee: Kent Yao > Perform propagating empty relation after RewritePredicateSubquery > - > > Key: SPARK-31280 > URL: https://issues.apache.org/jira/browse/SPARK-31280 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > {code:java} > scala> spark.sql(" select * from values(1), (2) t(key) where key in (select 1 > as key where 1=0)").queryExecution > res15: org.apache.spark.sql.execution.QueryExecution = > == Parsed Logical Plan == > 'Project [*] > +- 'Filter 'key IN (list#39 []) >: +- Project [1 AS key#38] >: +- Filter (1 = 0) >:+- OneRowRelation >+- 'SubqueryAlias t > +- 'UnresolvedInlineTable [key], [List(1), List(2)] > == Analyzed Logical Plan == > key: int > Project [key#40] > +- Filter key#40 IN (list#39 []) >: +- Project [1 AS key#38] >: +- Filter (1 = 0) >:+- OneRowRelation >+- SubqueryAlias t > +- LocalRelation [key#40] > == Optimized Logical Plan == > Join LeftSemi, (key#40 = key#38) > :- LocalRelation [key#40] > +- LocalRelation , [key#38] > == Physical Plan == > *(1) BroadcastHashJoin [key#40], [key#38], LeftSemi, BuildRight > :- *(1) LocalTableScan [key#40] > +- Br... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31297) Speed-up date-time rebasing
[ https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070457#comment-17070457 ] Maxim Gekk commented on SPARK-31297: The rebasing of days doesn't depend on time zone, and has just 14 special dates: {code:scala} test("optimize rebasing") { val start = localDateToDays(LocalDate.of(1, 1, 1)) val end = localDateToDays(LocalDate.of(2030, 1, 1)) var days = start var diff = Long.MaxValue var counter = 0 while (days < end) { val rebased = rebaseGregorianToJulianDays(days) val curDiff = rebased - days if (curDiff != diff) { counter += 1 diff = curDiff val ld = daysToLocalDate(days) println(s"local date = $ld days = $days diff = ${diff} days") } days += 1 } println(s"counter = $counter") } {code} {code} local date = 0001-01-01 days = -719162 diff = -2 days local date = 0100-03-01 days = -682944 diff = -1 days local date = 0200-03-01 days = -646420 diff = 0 days local date = 0300-03-01 days = -609896 diff = 1 days local date = 0500-03-01 days = -536847 diff = 2 days local date = 0600-03-01 days = -500323 diff = 3 days local date = 0700-03-01 days = -463799 diff = 4 days local date = 0900-03-01 days = -390750 diff = 5 days local date = 1000-03-01 days = -354226 diff = 6 days local date = 1100-03-01 days = -317702 diff = 7 days local date = 1300-03-01 days = -244653 diff = 8 days local date = 1400-03-01 days = -208129 diff = 9 days local date = 1500-03-01 days = -171605 diff = 10 days local date = 1582-10-15 days = -141427 diff = 0 days counter = 14 {code} > Speed-up date-time rebasing > --- > > Key: SPARK-31297 > URL: https://issues.apache.org/jira/browse/SPARK-31297 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > I do believe it is possible to speed up date-time rebasing by building a map > of micros to diffs between original and rebased micros. And look up at the > map via binary search. > For example, the *America/Los_Angeles* time zone has less than 100 points > when diff changes: > {code:scala} > test("optimize rebasing") { > val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > var micros = start > var diff = Long.MaxValue > var counter = 0 > while (micros < end) { > val rebased = rebaseGregorianToJulianMicros(micros) > val curDiff = rebased - micros > if (curDiff != diff) { > counter += 1 > diff = curDiff > val ldt = > microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime > println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} > minutes") > } > micros += MICROS_PER_HOUR > } > println(s"counter = $counter") > } > {code} > {code:java} > local date-time = 0001-01-01T00:00 diff = -2909 minutes > local date-time = 0100-02-28T14:00 diff = -1469 minutes > local date-time = 0200-02-28T14:00 diff = -29 minutes > local date-time = 0300-02-28T14:00 diff = 1410 minutes > local date-time = 0500-02-28T14:00 diff = 2850 minutes > local date-time = 0600-02-28T14:00 diff = 4290 minutes > local date-time = 0700-02-28T14:00 diff = 5730 minutes > local date-time = 0900-02-28T14:00 diff = 7170 minutes > local date-time = 1000-02-28T14:00 diff = 8610 minutes > local date-time = 1100-02-28T14:00 diff = 10050 minutes > local date-time = 1300-02-28T14:00 diff = 11490 minutes > local date-time = 1400-02-28T14:00 diff = 12930 minutes > local date-time = 1500-02-28T14:00 diff = 14370 minutes > local date-time = 1582-10-14T14:00 diff = -29 minutes > local date-time = 1899-12-31T16:52:58 diff = 0 minutes > local date-time = 1917-12-27T11:52:58 diff = 60 minutes > local date-time = 1917-12-27T12:52:58 diff = 0 minutes > local date-time = 1918-09-15T12:52:58 diff = 60 minutes > local date-time = 1918-09-15T13:52:58 diff = 0 minutes > local date-time = 1919-06-30T16:52:58 diff = 31 minutes > local date-time = 1919-06-30T17:52:58 diff = 0 minutes > local date-time = 1919-08-15T12:52:58 diff = 60 minutes > local date-time = 1919-08-15T13:52:58 diff = 0 minutes > local date-time = 1921-08-31T10:52:58 diff = 60 minutes > local date-time = 1921-08-31T11:52:58 diff = 0 minutes > local date-time = 1921-09-30T11:52:58 diff = 60 minutes > local date-time = 1921-09-30T12:52:58 diff = 0 minutes > local date-time = 1922-09-30T12:52:58 diff = 60 minutes > local date-time = 1922-09-30T13:52:58 diff = 0 minutes > local date-time = 1981-09-30T12:52:58 diff = 60 minutes > local date-time =
[jira] [Commented] (SPARK-30272) Remove usage of Guava that breaks in Guava 27
[ https://issues.apache.org/jira/browse/SPARK-30272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070391#comment-17070391 ] Sean R. Owen commented on SPARK-30272: -- The reason is simply that this is the version Hadoop uses through 3.2.0 and the binding to Hadoop still relatively tight. We do shade Guava, and honestly, I have lost track myself of whether this means we could vary the Guava version in Spark or whether it causes problems in some build configurations. I think it would only work in the "Hadoop provided" build? But anyway, I've tried for the simplistic solution of simply letting Spark work with Guava 14 through 27 equally. It does limit what Spark can use it for a bit, but just a little. I am still kind of puzzled by the fix you found. Did you manually update the Hadoop version to 3.2.1? I'd be surprised if hadoop-azure mattered as it isn't the thing that pulls in the class in question, Guava is, and in any event, it would not refer to or include a Spark shaded version. I can't explain that one. > Remove usage of Guava that breaks in Guava 27 > - > > Key: SPARK-30272 > URL: https://issues.apache.org/jira/browse/SPARK-30272 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.0.0 > > > Background: > https://issues.apache.org/jira/browse/SPARK-29250 > https://github.com/apache/spark/pull/25932 > Hadoop 3.2.1 will update Guava from 11 to 27. There are a number of methods > that changed between those releases, typically just a rename, but, means one > set of code can't work with both, while we want to work with Hadoop 2.x and > 3.x. Among them: > - Objects.toStringHelper was moved to MoreObjects; we can just use the > Commons Lang3 equivalent > - Objects.hashCode etc were renamed; use java.util.Objects equivalents > - MoreExecutors.sameThreadExecutor() became directExecutor(); for same-thread > execution we can use a dummy implementation of ExecutorService / Executor > - TypeToken.isAssignableFrom become isSupertypeOf; work around with reflection > There is probably more to the Guava issue than just this change, but it will > make Spark itself work with more versions and reduce our exposure to Guava > along the way anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31298) validate CTAS table path in SPARK-19724 seems conflict and External table also need to check non-empty
[ https://issues.apache.org/jira/browse/SPARK-31298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31298: -- Summary: validate CTAS table path in SPARK-19724 seems conflict and External table also need to check non-empty (was: validate CTAS table path in SPARK-19724 seems conflict) > validate CTAS table path in SPARK-19724 seems conflict and External table > also need to check non-empty > -- > > Key: SPARK-31298 > URL: https://issues.apache.org/jira/browse/SPARK-31298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > In SessionCatalog.validateTableLocation() > {code:java} > val tableLocation = > new > Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier))) > {code} > But in CreateDataSourceTableAsSelect , table location use defaultTablePath > {code:java} > assert(table.schema.isEmpty) > sparkSession.sessionState.catalog.validateTableLocation(table) > val tableLocation = if (table.tableType == CatalogTableType.MANAGED) { > Some(sessionState.catalog.defaultTablePath(table.identifier)) > } else { > table.storage.locationUri > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31298) validate CTAS table path in SPARK-19724 seems conflict
[ https://issues.apache.org/jira/browse/SPARK-31298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31298: -- Summary: validate CTAS table path in SPARK-19724 seems conflict (was: validate External path in SPARK-19724 seems conflict) > validate CTAS table path in SPARK-19724 seems conflict > -- > > Key: SPARK-31298 > URL: https://issues.apache.org/jira/browse/SPARK-31298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > In SessionCatalog.validateTableLocation() > {code:java} > val tableLocation = > new > Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier))) > {code} > But in CreateDataSourceTableAsSelect , table location use defaultTablePath > {code:java} > assert(table.schema.isEmpty) > sparkSession.sessionState.catalog.validateTableLocation(table) > val tableLocation = if (table.tableType == CatalogTableType.MANAGED) { > Some(sessionState.catalog.defaultTablePath(table.identifier)) > } else { > table.storage.locationUri > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31298) validate External path in SPARK-19724 seems conflict
[ https://issues.apache.org/jira/browse/SPARK-31298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31298: -- Description: In SessionCatalog.validateTableLocation() {code:java} val tableLocation = new Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier))) {code} But in CreateDataSourceTableAsSelect , table location use defaultTablePath {code:java} assert(table.schema.isEmpty) sparkSession.sessionState.catalog.validateTableLocation(table) val tableLocation = if (table.tableType == CatalogTableType.MANAGED) { Some(sessionState.catalog.defaultTablePath(table.identifier)) } else { table.storage.locationUri } {code} > validate External path in SPARK-19724 seems conflict > > > Key: SPARK-31298 > URL: https://issues.apache.org/jira/browse/SPARK-31298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > In SessionCatalog.validateTableLocation() > {code:java} > val tableLocation = > new > Path(table.storage.locationUri.getOrElse(defaultTablePath(table.identifier))) > {code} > But in CreateDataSourceTableAsSelect , table location use defaultTablePath > {code:java} > assert(table.schema.isEmpty) > sparkSession.sessionState.catalog.validateTableLocation(table) > val tableLocation = if (table.tableType == CatalogTableType.MANAGED) { > Some(sessionState.catalog.defaultTablePath(table.identifier)) > } else { > table.storage.locationUri > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31298) validate External path in SPARK-19724 seems conflict
angerszhu created SPARK-31298: - Summary: validate External path in SPARK-19724 seems conflict Key: SPARK-31298 URL: https://issues.apache.org/jira/browse/SPARK-31298 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31297) Speed-up date-time rebasing
[ https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070286#comment-17070286 ] Maxim Gekk commented on SPARK-31297: [~cloud_fan] [~hyukjin.kwon] [~dongjoon] WDYT? > Speed-up date-time rebasing > --- > > Key: SPARK-31297 > URL: https://issues.apache.org/jira/browse/SPARK-31297 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > I do believe it is possible to speed up date-time rebasing by building a map > of micros to diffs between original and rebased micros. And look up at the > map via binary search. > For example, the *America/Los_Angeles* time zone has less than 100 points > when diff changes: > {code:scala} > test("optimize rebasing") { > val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > var micros = start > var diff = Long.MaxValue > var counter = 0 > while (micros < end) { > val rebased = rebaseGregorianToJulianMicros(micros) > val curDiff = rebased - micros > if (curDiff != diff) { > counter += 1 > diff = curDiff > val ldt = > microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime > println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} > minutes") > } > micros += MICROS_PER_HOUR > } > println(s"counter = $counter") > } > {code} > {code:java} > local date-time = 0001-01-01T00:00 diff = -2909 minutes > local date-time = 0100-02-28T14:00 diff = -1469 minutes > local date-time = 0200-02-28T14:00 diff = -29 minutes > local date-time = 0300-02-28T14:00 diff = 1410 minutes > local date-time = 0500-02-28T14:00 diff = 2850 minutes > local date-time = 0600-02-28T14:00 diff = 4290 minutes > local date-time = 0700-02-28T14:00 diff = 5730 minutes > local date-time = 0900-02-28T14:00 diff = 7170 minutes > local date-time = 1000-02-28T14:00 diff = 8610 minutes > local date-time = 1100-02-28T14:00 diff = 10050 minutes > local date-time = 1300-02-28T14:00 diff = 11490 minutes > local date-time = 1400-02-28T14:00 diff = 12930 minutes > local date-time = 1500-02-28T14:00 diff = 14370 minutes > local date-time = 1582-10-14T14:00 diff = -29 minutes > local date-time = 1899-12-31T16:52:58 diff = 0 minutes > local date-time = 1917-12-27T11:52:58 diff = 60 minutes > local date-time = 1917-12-27T12:52:58 diff = 0 minutes > local date-time = 1918-09-15T12:52:58 diff = 60 minutes > local date-time = 1918-09-15T13:52:58 diff = 0 minutes > local date-time = 1919-06-30T16:52:58 diff = 31 minutes > local date-time = 1919-06-30T17:52:58 diff = 0 minutes > local date-time = 1919-08-15T12:52:58 diff = 60 minutes > local date-time = 1919-08-15T13:52:58 diff = 0 minutes > local date-time = 1921-08-31T10:52:58 diff = 60 minutes > local date-time = 1921-08-31T11:52:58 diff = 0 minutes > local date-time = 1921-09-30T11:52:58 diff = 60 minutes > local date-time = 1921-09-30T12:52:58 diff = 0 minutes > local date-time = 1922-09-30T12:52:58 diff = 60 minutes > local date-time = 1922-09-30T13:52:58 diff = 0 minutes > local date-time = 1981-09-30T12:52:58 diff = 60 minutes > local date-time = 1981-09-30T13:52:58 diff = 0 minutes > local date-time = 1982-09-30T12:52:58 diff = 60 minutes > local date-time = 1982-09-30T13:52:58 diff = 0 minutes > local date-time = 1983-09-30T12:52:58 diff = 60 minutes > local date-time = 1983-09-30T13:52:58 diff = 0 minutes > local date-time = 1984-09-29T15:52:58 diff = 60 minutes > local date-time = 1984-09-29T16:52:58 diff = 0 minutes > local date-time = 1985-09-28T15:52:58 diff = 60 minutes > local date-time = 1985-09-28T16:52:58 diff = 0 minutes > local date-time = 1986-09-27T15:52:58 diff = 60 minutes > local date-time = 1986-09-27T16:52:58 diff = 0 minutes > local date-time = 1987-09-26T15:52:58 diff = 60 minutes > local date-time = 1987-09-26T16:52:58 diff = 0 minutes > local date-time = 1988-09-24T15:52:58 diff = 60 minutes > local date-time = 1988-09-24T16:52:58 diff = 0 minutes > local date-time = 1989-09-23T15:52:58 diff = 60 minutes > local date-time = 1989-09-23T16:52:58 diff = 0 minutes > local date-time = 1990-09-29T15:52:58 diff = 60 minutes > local date-time = 1990-09-29T16:52:58 diff = 0 minutes > local date-time = 1991-09-28T16:52:58 diff = 60 minutes > local date-time = 1991-09-28T17:52:58 diff = 0 minutes > local date-time = 1992-09-26T15:52:58 diff = 60 minutes > local date-time = 1992-09-26T16:52:58 diff = 0 minutes > local date-time = 1993-09-25T15:52:58 diff = 60 minutes > local date-time = 1993-09-25T16:52:58 diff = 0
[jira] [Created] (SPARK-31297) Speed-up date-time rebasing
Maxim Gekk created SPARK-31297: -- Summary: Speed-up date-time rebasing Key: SPARK-31297 URL: https://issues.apache.org/jira/browse/SPARK-31297 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk I do believe it is possible to speed up date-time rebasing by building a map of micros to diffs between original and rebased micros. And look up at the map via binary search. For example, the *America/Los_Angeles* time zone has less than 100 points when diff changes: {code:scala} test("optimize rebasing") { val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) var micros = start var diff = Long.MaxValue var counter = 0 while (micros < end) { val rebased = rebaseGregorianToJulianMicros(micros) val curDiff = rebased - micros if (curDiff != diff) { counter += 1 diff = curDiff val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes") } micros += MICROS_PER_HOUR } println(s"counter = $counter") } {code} {code:java} local date-time = 0001-01-01T00:00 diff = -2909 minutes local date-time = 0100-02-28T14:00 diff = -1469 minutes local date-time = 0200-02-28T14:00 diff = -29 minutes local date-time = 0300-02-28T14:00 diff = 1410 minutes local date-time = 0500-02-28T14:00 diff = 2850 minutes local date-time = 0600-02-28T14:00 diff = 4290 minutes local date-time = 0700-02-28T14:00 diff = 5730 minutes local date-time = 0900-02-28T14:00 diff = 7170 minutes local date-time = 1000-02-28T14:00 diff = 8610 minutes local date-time = 1100-02-28T14:00 diff = 10050 minutes local date-time = 1300-02-28T14:00 diff = 11490 minutes local date-time = 1400-02-28T14:00 diff = 12930 minutes local date-time = 1500-02-28T14:00 diff = 14370 minutes local date-time = 1582-10-14T14:00 diff = -29 minutes local date-time = 1899-12-31T16:52:58 diff = 0 minutes local date-time = 1917-12-27T11:52:58 diff = 60 minutes local date-time = 1917-12-27T12:52:58 diff = 0 minutes local date-time = 1918-09-15T12:52:58 diff = 60 minutes local date-time = 1918-09-15T13:52:58 diff = 0 minutes local date-time = 1919-06-30T16:52:58 diff = 31 minutes local date-time = 1919-06-30T17:52:58 diff = 0 minutes local date-time = 1919-08-15T12:52:58 diff = 60 minutes local date-time = 1919-08-15T13:52:58 diff = 0 minutes local date-time = 1921-08-31T10:52:58 diff = 60 minutes local date-time = 1921-08-31T11:52:58 diff = 0 minutes local date-time = 1921-09-30T11:52:58 diff = 60 minutes local date-time = 1921-09-30T12:52:58 diff = 0 minutes local date-time = 1922-09-30T12:52:58 diff = 60 minutes local date-time = 1922-09-30T13:52:58 diff = 0 minutes local date-time = 1981-09-30T12:52:58 diff = 60 minutes local date-time = 1981-09-30T13:52:58 diff = 0 minutes local date-time = 1982-09-30T12:52:58 diff = 60 minutes local date-time = 1982-09-30T13:52:58 diff = 0 minutes local date-time = 1983-09-30T12:52:58 diff = 60 minutes local date-time = 1983-09-30T13:52:58 diff = 0 minutes local date-time = 1984-09-29T15:52:58 diff = 60 minutes local date-time = 1984-09-29T16:52:58 diff = 0 minutes local date-time = 1985-09-28T15:52:58 diff = 60 minutes local date-time = 1985-09-28T16:52:58 diff = 0 minutes local date-time = 1986-09-27T15:52:58 diff = 60 minutes local date-time = 1986-09-27T16:52:58 diff = 0 minutes local date-time = 1987-09-26T15:52:58 diff = 60 minutes local date-time = 1987-09-26T16:52:58 diff = 0 minutes local date-time = 1988-09-24T15:52:58 diff = 60 minutes local date-time = 1988-09-24T16:52:58 diff = 0 minutes local date-time = 1989-09-23T15:52:58 diff = 60 minutes local date-time = 1989-09-23T16:52:58 diff = 0 minutes local date-time = 1990-09-29T15:52:58 diff = 60 minutes local date-time = 1990-09-29T16:52:58 diff = 0 minutes local date-time = 1991-09-28T16:52:58 diff = 60 minutes local date-time = 1991-09-28T17:52:58 diff = 0 minutes local date-time = 1992-09-26T15:52:58 diff = 60 minutes local date-time = 1992-09-26T16:52:58 diff = 0 minutes local date-time = 1993-09-25T15:52:58 diff = 60 minutes local date-time = 1993-09-25T16:52:58 diff = 0 minutes local date-time = 1994-09-24T15:52:58 diff = 60 minutes local date-time = 1994-09-24T16:52:58 diff = 0 minutes local date-time = 1995-09-23T15:52:58 diff = 60 minutes local date-time = 1995-09-23T16:52:58 diff = 0 minutes local date-time = 1996-10-26T15:52:58 diff = 60 minutes local date-time = 1996-10-26T16:52:58 diff = 0 minutes local date-time = 1997-10-25T15:52:58 diff = 60 minutes local date-time = 1997-10-25T16:52:58 diff =
[jira] [Commented] (SPARK-30272) Remove usage of Guava that breaks in Guava 27
[ https://issues.apache.org/jira/browse/SPARK-30272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070255#comment-17070255 ] Jorge Machado commented on SPARK-30272: --- So I was able to fix it. I build it with profile hadoop 3.2 but after the build the hadoop-azure.jar is missing so I added manually into my container and now it seems to load. I was trying to put guava 28 and remove the 14 but this is a lot of work... why do we use a old guava version ? > Remove usage of Guava that breaks in Guava 27 > - > > Key: SPARK-30272 > URL: https://issues.apache.org/jira/browse/SPARK-30272 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.0.0 > > > Background: > https://issues.apache.org/jira/browse/SPARK-29250 > https://github.com/apache/spark/pull/25932 > Hadoop 3.2.1 will update Guava from 11 to 27. There are a number of methods > that changed between those releases, typically just a rename, but, means one > set of code can't work with both, while we want to work with Hadoop 2.x and > 3.x. Among them: > - Objects.toStringHelper was moved to MoreObjects; we can just use the > Commons Lang3 equivalent > - Objects.hashCode etc were renamed; use java.util.Objects equivalents > - MoreExecutors.sameThreadExecutor() became directExecutor(); for same-thread > execution we can use a dummy implementation of ExecutorService / Executor > - TypeToken.isAssignableFrom become isSupertypeOf; work around with reflection > There is probably more to the Guava issue than just this change, but it will > make Spark itself work with more versions and reduce our exposure to Guava > along the way anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31296) Benchmark date-time rebasing in Parquet datasource
[ https://issues.apache.org/jira/browse/SPARK-31296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31296: --- Summary: Benchmark date-time rebasing in Parquet datasource (was: Benchmark date-time rebasing to/from Julian calendar) > Benchmark date-time rebasing in Parquet datasource > -- > > Key: SPARK-31296 > URL: https://issues.apache.org/jira/browse/SPARK-31296 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > * Add benchmarks for saving dates/timestamps to parquet when > spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true > * Add bechmark for loading dates/timestamps from parquet when rebasing is on -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31296) Benchmark date-time rebasing to/from Julian calendar
Maxim Gekk created SPARK-31296: -- Summary: Benchmark date-time rebasing to/from Julian calendar Key: SPARK-31296 URL: https://issues.apache.org/jira/browse/SPARK-31296 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk * Add benchmarks for saving dates/timestamps to parquet when spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true * Add bechmark for loading dates/timestamps from parquet when rebasing is on -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30363) Add Documentation for Refresh Resources
[ https://issues.apache.org/jira/browse/SPARK-30363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-30363: --- Parent: SPARK-28588 Issue Type: Sub-task (was: Improvement) > Add Documentation for Refresh Resources > --- > > Key: SPARK-30363 > URL: https://issues.apache.org/jira/browse/SPARK-30363 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Assignee: Rakesh Raushan >Priority: Minor > Fix For: 3.0.0 > > > Refresh Resources is not documented in the docs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org