[jira] [Created] (SPARK-31186) toPandas fails on simple query (collect() works)
Michael Chirico created SPARK-31186: --- Summary: toPandas fails on simple query (collect() works) Key: SPARK-31186 URL: https://issues.apache.org/jira/browse/SPARK-31186 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: Michael Chirico My pandas is 0.25.1. I ran the following simple code (cross joins are enabled): {code:python} spark.sql(''' select t1.*, t2.* from ( select explode(sequence(1, 3)) v ) t1 left join ( select explode(sequence(1, 3)) v ) t2 ''').toPandas() {code} and got a ValueError from pandas: > ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), > a.item(), a.any() or a.all(). Collect works fine: {code:python} spark.sql(''' select * from ( select explode(sequence(1, 3)) v ) t1 left join ( select explode(sequence(1, 3)) v ) t2 ''').collect() # [Row(v=1, v=1), # Row(v=1, v=2), # Row(v=1, v=3), # Row(v=2, v=1), # Row(v=2, v=2), # Row(v=2, v=3), # Row(v=3, v=1), # Row(v=3, v=2), # Row(v=3, v=3)] {code} I imagine it's related to the duplicate column names, but this doesn't fail: {code:python} spark.sql("select 1 v, 1 v").toPandas() # v v # 0 1 1 {code} Also no issue for multiple rows: spark.sql("select 1 v, 1 v union all select 1 v, 2 v").toPandas() It also works when not using a cross join but a janky programatically-generated union all query: {code:python} cond = [] for ii in range(3): for jj in range(3): cond.append(f'select {ii+1} v, {jj+1} v') spark.sql(' union all '.join(cond)).toPandas() {code} As near as I can tell, the output is identical to the explode output, making this issue all the more peculiar, as I thought toPandas() is applied to the output of collect(), so if collect() gives the same output, how can toPandas() fail in one case and not the other? Further, the lazy DataFrame is the same: DataFrame[v: int, v: int] in both cases. I must be missing something. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31159) Incompatible Parquet dates/timestamps with Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-31159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31159: Parent: SPARK-30951 Issue Type: Sub-task (was: Bug) > Incompatible Parquet dates/timestamps with Spark 2.4 > > > Key: SPARK-31159 > URL: https://issues.apache.org/jira/browse/SPARK-31159 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Write dates/timestamps to Parquet file in Spark 2.4: > {code} > $ export TZ="UTC" > $ ~/spark-2.4/bin/spark-shell > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.5 > /_/ > Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_231) > Type in expressions to have them evaluated. > Type :help for more information. > scala> spark.conf.set("spark.sql.session.timeZone", "UTC") > scala> val df = Seq(("1001-01-01", "1001-01-01 > 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), > $"tsS".cast("timestamp").as("ts")) > df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp] > scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros") > scala> spark.conf.set("spark.sql.parquet.outputTimestampType", > "TIMESTAMP_MICROS") > scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros") > scala> > spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false) > +--+--+ > |d |ts| > +--+--+ > |1001-01-01|1001-01-01 01:02:03.123456| > +--+--+ > {code} > Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool > prints *1001-01-07* and *1001-01-07T01:02:03.123456+*: > {code} > $ java -jar > /Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar > dump -m > ./2_4_5_micros/part-0-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet > INT32 d > > *** row group 1 of 1, values 1 to 1 *** > value 1: R:0 D:1 V:1001-01-07 > INT64 ts > > *** row group 1 of 1, values 1 to 1 *** > value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+ > {code} > Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but > different values from Spark 2.4: > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_231) > scala> > spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false) > +--+--+ > |d |ts| > +--+--+ > |1001-01-07|1001-01-07 01:02:03.123456| > +--+--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31183: Parent: SPARK-30951 Issue Type: Sub-task (was: Bug) > Incompatible Avro dates/timestamps with Spark 2.4 > - > > Key: SPARK-31183 > URL: https://issues.apache.org/jira/browse/SPARK-31183 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Write dates/timestamps to Avro file in Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5 > {code} > {code:scala} > scala> > df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |date | > +--+ > |1001-01-01| > +--+ > scala> > df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-01 01:02:03.123456| > +--+ > {code} > Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from > Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 > {code} > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) > +--+ > |date | > +--+ > |1001-01-07| > +--+ > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-07 01:09:05.123456| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31076) Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time
[ https://issues.apache.org/jira/browse/SPARK-31076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31076: Parent: SPARK-30951 Issue Type: Sub-task (was: Improvement) > Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time > > > Key: SPARK-31076 > URL: https://issues.apache.org/jira/browse/SPARK-31076 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > By default, collect() returns java.sql.Timestamp/Date instances with offsets > derived from internal values of Catalyst's TIMESTAMP/DATE that store > microseconds since the epoch. The conversion from internal values to > java.sql.Timestamp/Date based on Proleptic Gregorian calendar but converting > the resulted values before 1582 year to strings produces timestamp/date > string in Julian calendar. For example: > {code} > scala> sql("select date '1100-10-10'").collect() > res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03]) > {code} > This can be fixed if internal Catalyst's values are converted to local > date-time in Gregorian calendar, and construct local date-time from the > resulted year, month, ..., seconds in Julian calendar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28626) Spark leaves unencrypted data on local disk, even with encryption turned on (CVE-2019-10099)
[ https://issues.apache.org/jira/browse/SPARK-28626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062230#comment-17062230 ] Wing Yew Poon commented on SPARK-28626: --- For the record, to assist folks who need to backport this: >From branch-2.3, we also need >[https://github.com/apache/spark/commit/323dc3ad02e63a7c99b5bd6da618d6020657ecba] [PYSPARK] Update py4j to version 0.10.7. For the SPARKR change, there is a preceding change that is needed [https://github.com/apache/spark/commit/dad5c48b2a229bf6f9e6b8548f9335f04a15c818] [MINOR][PYTHON] Use a helper in `PythonUtils` instead of direct accessing Scala package > Spark leaves unencrypted data on local disk, even with encryption turned on > (CVE-2019-10099) > > > Key: SPARK-28626 > URL: https://issues.apache.org/jira/browse/SPARK-28626 > Project: Spark > Issue Type: Bug > Components: Security >Affects Versions: 2.3.2 >Reporter: Imran Rashid >Priority: Major > Fix For: 2.3.3, 2.4.0 > > > Severity: Important > > Vendor: The Apache Software Foundation > > Versions affected: > All Spark 1.x, Spark 2.0.x, Spark 2.1.x, and 2.2.x versions > Spark 2.3.0 to 2.3.2 > > Description: > Prior to Spark 2.3.3, in certain situations Spark would write user data to > local disk unencrypted, even if spark.io.encryption.enabled=true. This > includes cached blocks that are fetched to disk (controlled by > spark.maxRemoteBlockSizeFetchToMem); in SparkR, using parallelize; in > Pyspark, using broadcast and parallelize; and use of python udfs. > > > Mitigation: > 1.x, 2.0.x, 2.1.x, 2.2.x, 2.3.x users should upgrade to 2.3.3 or newer, > including 2.4.x > > Credit: > This issue was reported by Thomas Graves of NVIDIA. > > References: > [https://spark.apache.org/security.html] > > The following commits were used to fix this issue, in branch-2.3 (there may > be other commits in master / branch-2.4, that are equivalent.) > {noformat} > commit 575fea120e25249716e3f680396580c5f9e26b5b > Author: Imran Rashid > Date: Wed Aug 22 16:38:28 2018 -0500 > [CORE] Updates to remote cache reads > Covered by tests in DistributedSuite > > commit 6d742d1bd71aa3803dce91a830b37284cb18cf70 > Author: Imran Rashid > Date: Thu Sep 6 12:11:47 2018 -0500 > [PYSPARK][SQL] Updates to RowQueue > Tested with updates to RowQueueSuite > > commit 09dd34cb1706f2477a89174d6a1a0f17ed5b0a65 > Author: Imran Rashid > Date: Mon Aug 13 21:35:34 2018 -0500 > [PYSPARK] Updates to pyspark broadcast > > commit 12717ba0edfa5459c9ac2085f46b1ecc0ee759aa > Author: hyukjinkwon > Date: Mon Sep 24 19:25:02 2018 +0800 > [SPARKR] Match pyspark features in SparkR communication protocol > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27941) Serverless Spark in the Cloud
[ https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062198#comment-17062198 ] Dongjoon Hyun commented on SPARK-27941: --- Hi, [~mu5358271]. Is there any update on this issue? > Serverless Spark in the Cloud > - > > Key: SPARK-27941 > URL: https://issues.apache.org/jira/browse/SPARK-27941 > Project: Spark > Issue Type: New Feature > Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Shuheng Dai >Priority: Major > > Public cloud providers have started offering serverless container services. > For example, AWS offers Fargate [https://aws.amazon.com/fargate/] > This opens up the possibility to run Spark workloads in a serverless manner > and remove the need to provision, maintain and manage a cluster. POC: > [https://github.com/mu5358271/spark-on-fargate] > While it might not make sense for Spark to favor any particular cloud > provider or to support a large number of cloud providers natively, it would > make sense to make some of the internal Spark components more pluggable and > cloud friendly so that it is easier for various cloud providers to integrate. > For example, > * authentication: IO and network encryption requires authentication via > securely sharing a secret, and the implementation of this is currently tied > to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file > mounted on all pods. These can be decoupled so it is possible to swap in > implementation using public cloud. In the POC, this is implemented by passing > around AWS KMS encrypted secret and decrypting the secret at each executor, > which delegate authentication and authorization to the cloud. > * deployment & scheduler: adding a new cluster manager and scheduler backend > requires changing a number of places in the Spark core package and rebuilding > the entire project. Having a pluggable scheduler per > https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add > different scheduler backends backed by different cloud providers. > * client-cluster communication: I am not very familiar with the network part > of the code base so I might be wrong on this. My understanding is that the > code base assumes that the client and the cluster are on the same network and > the nodes communicate with each other via hostname/ip. For security best > practice, it is advised to run the executors in a private protected network, > which may be separate from the client machine's network. Since we are > serverless, that means the client need to first launch the driver into the > private network, and the driver in turn start the executors, potentially > doubling job initialization time. This can be solved by dropping complete > serverlessness and having a persistent host in the private network, or (I do > not have a POC, so I am not sure if this actually works) by implementing > client-cluster communication via message queues in the cloud to get around > the network separation. > * shuffle storage and retrieval: external shuffle in yarn relies on the > existence of a persistent cluster that continues to serve shuffle files > beyond the lifecycle of the executors. This assumption no longer holds in a > serverless cluster with only transient containers. Pluggable remote shuffle > storage per https://issues.apache.org/jira/browse/SPARK-25299 would make it > easier to introduce new cloud-backed shuffle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31178) sql("INSERT INTO v2DataSource ...").collect() double inserts
[ https://issues.apache.org/jira/browse/SPARK-31178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-31178. - Fix Version/s: 3.0.0 Assignee: Burak Yavuz Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/27941] > sql("INSERT INTO v2DataSource ...").collect() double inserts > > > Key: SPARK-31178 > URL: https://issues.apache.org/jira/browse/SPARK-31178 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Blocker > Fix For: 3.0.0 > > > The following unit test fails in DataSourceV2SQLSuite: > {code:java} > test("do not double insert on INSERT INTO collect()") { > import testImplicits._ > val t1 = s"${catalogAndNamespace}tbl" > sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format") > val tmpView = "test_data" > val df = Seq((1L, "a"), (2L, "b"), (3L, "c")).toDF("id", "data") > df.createOrReplaceTempView(tmpView) > sql(s"INSERT INTO TABLE $t1 SELECT * FROM $tmpView").collect() > verifyTable(t1, df) > } {code} > The INSERT INTO is double inserting when ".collect()" is called. I think this > is because the V2 SparkPlans are not commands, and doExecute on a Spark plan > can be called multiple times causing data to be inserted multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31185) implement VarianceThresholdSelector
Huaxin Gao created SPARK-31185: -- Summary: implement VarianceThresholdSelector Key: SPARK-31185 URL: https://issues.apache.org/jira/browse/SPARK-31185 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.1.0 Reporter: Huaxin Gao Implement a Feature selector that removes all low-variance features. Features with a variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30933) ML, GraphX 3.0 QA: Update user guide for new features & APIs
[ https://issues.apache.org/jira/browse/SPARK-30933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30933. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27880 [https://github.com/apache/spark/pull/27880] > ML, GraphX 3.0 QA: Update user guide for new features & APIs > > > Key: SPARK-30933 > URL: https://issues.apache.org/jira/browse/SPARK-30933 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.0.0 > > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > For MLlib: > * This task does not include major reorganizations for the programming guide. > * We should now begin copying algorithm details from the spark.mllib guide > to spark.ml as needed, rather than just linking back to the corresponding > algorithms in the spark.mllib user guide. > If you would like to work on this task, please comment, and we can create & > link JIRAs for parts of this work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30933) ML, GraphX 3.0 QA: Update user guide for new features & APIs
[ https://issues.apache.org/jira/browse/SPARK-30933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30933: Assignee: Huaxin Gao (was: zhengruifeng) > ML, GraphX 3.0 QA: Update user guide for new features & APIs > > > Key: SPARK-30933 > URL: https://issues.apache.org/jira/browse/SPARK-30933 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > For MLlib: > * This task does not include major reorganizations for the programming guide. > * We should now begin copying algorithm details from the spark.mllib guide > to spark.ml as needed, rather than just linking back to the corresponding > algorithms in the spark.mllib user guide. > If you would like to work on this task, please comment, and we can create & > link JIRAs for parts of this work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30933) ML, GraphX 3.0 QA: Update user guide for new features & APIs
[ https://issues.apache.org/jira/browse/SPARK-30933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30933: Assignee: zhengruifeng > ML, GraphX 3.0 QA: Update user guide for new features & APIs > > > Key: SPARK-30933 > URL: https://issues.apache.org/jira/browse/SPARK-30933 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > For MLlib: > * This task does not include major reorganizations for the programming guide. > * We should now begin copying algorithm details from the spark.mllib guide > to spark.ml as needed, rather than just linking back to the corresponding > algorithms in the spark.mllib user guide. > If you would like to work on this task, please comment, and we can create & > link JIRAs for parts of this work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31184) Support getTablesByType API of Hive Client
Xin Wu created SPARK-31184: -- Summary: Support getTablesByType API of Hive Client Key: SPARK-31184 URL: https://issues.apache.org/jira/browse/SPARK-31184 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Xin Wu Hive 2.3+ supports getTablesByType API, which is a precondition to implement SHOW VIEWS in HiveExternalCatalog. Currently, without this API, we can not get hive table with type HiveTableType.VIRTUAL_VIEW directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061916#comment-17061916 ] Reynold Xin commented on SPARK-25728: - It's too big of a change and realistically speaking probably only a few people in the world that can do this well. I'm going to close it. > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > This JIRA entry is to start a discussion about adding structure intermediate > representation for generating Java code from a program using DataFrame or > Dataset API, in addition to the current String-based representation. > This addition is based on the discussions in [a > thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196]. > Please feel free to comment on this JIRA entry or [Google > Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing], > too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-25728. - Resolution: Won't Fix > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > This JIRA entry is to start a discussion about adding structure intermediate > representation for generating Java code from a program using DataFrame or > Dataset API, in addition to the current String-based representation. > This addition is based on the discussions in [a > thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196]. > Please feel free to comment on this JIRA entry or [Google > Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing], > too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061914#comment-17061914 ] Kazuaki Ishizaki commented on SPARK-25728: -- For now, no update on my side. I am happy if the community or PMCs want to revitalize it. Otherwise, should I close this? > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > This JIRA entry is to start a discussion about adding structure intermediate > representation for generating Java code from a program using DataFrame or > Dataset API, in addition to the current String-based representation. > This addition is based on the discussions in [a > thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196]. > Please feel free to comment on this JIRA entry or [Google > Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing], > too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061902#comment-17061902 ] Maxim Gekk commented on SPARK-31183: I am working on the issue. > Incompatible Avro dates/timestamps with Spark 2.4 > - > > Key: SPARK-31183 > URL: https://issues.apache.org/jira/browse/SPARK-31183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Write dates/timestamps to Avro file in Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5 > {code} > {code:scala} > scala> > df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |date | > +--+ > |1001-01-01| > +--+ > scala> > df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-01 01:02:03.123456| > +--+ > {code} > Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from > Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 > {code} > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) > +--+ > |date | > +--+ > |1001-01-07| > +--+ > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-07 01:09:05.123456| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061903#comment-17061903 ] Maxim Gekk commented on SPARK-31183: [~cloud_fan] FYI > Incompatible Avro dates/timestamps with Spark 2.4 > - > > Key: SPARK-31183 > URL: https://issues.apache.org/jira/browse/SPARK-31183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Write dates/timestamps to Avro file in Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5 > {code} > {code:scala} > scala> > df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |date | > +--+ > |1001-01-01| > +--+ > scala> > df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-01 01:02:03.123456| > +--+ > {code} > Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from > Spark 2.4.5: > {code} > $ export TZ="America/Los_Angeles" > $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 > {code} > {code:scala} > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) > +--+ > |date | > +--+ > |1001-01-07| > +--+ > scala> > spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) > +--+ > |ts| > +--+ > |1001-01-07 01:09:05.123456| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4
Maxim Gekk created SPARK-31183: -- Summary: Incompatible Avro dates/timestamps with Spark 2.4 Key: SPARK-31183 URL: https://issues.apache.org/jira/browse/SPARK-31183 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Write dates/timestamps to Avro file in Spark 2.4.5: {code} $ export TZ="America/Los_Angeles" $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5 {code} {code:scala} scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) +--+ |date | +--+ |1001-01-01| +--+ scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) +--+ |ts| +--+ |1001-01-01 01:02:03.123456| +--+ {code} Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from Spark 2.4.5: {code} $ export TZ="America/Los_Angeles" $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 {code} {code:scala} scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +--+ |date | +--+ |1001-01-07| +--+ scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false) +--+ |ts| +--+ |1001-01-07 01:09:05.123456| +--+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31175) Avoid creating reverse comparator for each compare in InterpretedOrdering
[ https://issues.apache.org/jira/browse/SPARK-31175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31175: --- Assignee: wuyi > Avoid creating reverse comparator for each compare in InterpretedOrdering > - > > Key: SPARK-31175 > URL: https://issues.apache.org/jira/browse/SPARK-31175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > Currently, we'll create a new reverse comparator for each compare in > InterpretedOrdering, which could generate lots of small and instant object to > hurt JVM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31175) Avoid creating reverse comparator for each compare in InterpretedOrdering
[ https://issues.apache.org/jira/browse/SPARK-31175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31175. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27938 [https://github.com/apache/spark/pull/27938] > Avoid creating reverse comparator for each compare in InterpretedOrdering > - > > Key: SPARK-31175 > URL: https://issues.apache.org/jira/browse/SPARK-31175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > Currently, we'll create a new reverse comparator for each compare in > InterpretedOrdering, which could generate lots of small and instant object to > hurt JVM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30295) Remove Hive dependencies from SparkSQLCLI
[ https://issues.apache.org/jira/browse/SPARK-30295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30295. -- Resolution: Won't Fix > Remove Hive dependencies from SparkSQLCLI > - > > Key: SPARK-30295 > URL: https://issues.apache.org/jira/browse/SPARK-30295 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Javier Fuentes >Priority: Major > > Removal of unnecessary hive dependencies for the Spark SQL Client. Replacing > that with a native Scala implementation. For the client driver, argument > parser and SparkSqlCliDriver. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31165) Multiple wrong references in Dockerfile for k8s
[ https://issues.apache.org/jira/browse/SPARK-31165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikolay Dimolarov updated SPARK-31165: -- Description: I am currently trying to follow the k8s instructions for Spark: [https://spark.apache.org/docs/latest/running-on-kubernetes.html] and when I clone apache/spark on GitHub on the master branch I saw multiple wrong folder references after trying to build my Docker image: *Issue 1: The comments in the Dockerfile reference the wrong folder for the Dockerfile:* {code:java} # If this docker file is being used in the context of building your images from a Spark # distribution, the docker build command should be invoked from the top level directory # of the Spark distribution. E.g.: # docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .{code} Well that docker build command simply won't run. I only got the following to run: {code:java} docker build -t spark:latest -f resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile . {code} which is the actual path to the Dockerfile. *Issue 2: jars folder does not exist* After I read the tutorial I of course build spark first as per the instructions with: {code:java} ./build/mvn -Pkubernetes -DskipTests clean package{code} Nonetheless, in the Dockerfile I get this error when building: {code:java} Step 5/18 : COPY jars /opt/spark/jars COPY failed: stat /var/lib/docker/tmp/docker-builder402673637/jars: no such file or directory{code} for which I may have found a similar issue here: [https://stackoverflow.com/questions/52451538/spark-for-kubernetes-test-on-mac] I am new to Spark but I assume that this jars folder - if the build step would actually make it and I ran the maven build of the master branch successfully with the command I mentioned above - would exist in the root folder of the project. Turns out it's here: spark/assembly/target/scala-2.12/jars *Issue 3: missing entrypoint.sh and decom.sh due to wrong reference* While Issue 2 remains unresolved as I can't wrap my head around the missing jars folder (bin and sbin got copied successfully after I made a dummy jars folder) I then got stuck on these 2 steps: {code:java} COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/ COPY kubernetes/dockerfiles/spark/decom.sh /opt/{code} with: {code:java} Step 8/18 : COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/ COPY failed: stat /var/lib/docker/tmp/docker-builder638219776/kubernetes/dockerfiles/spark/entrypoint.sh: no such file or directory{code} which makes sense since the path should actually be: resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh resource-managers/kubernetes/docker/src/main/dockerfiles/spark/decom.sh *Remark* I only created one issue since this seems like somebody cleaned up the repo and forgot to change these. Am I missing something here? If I am, I apologise in advance since I am new to the Spark project. I also saw that some of these references were handled through vars in previous branches: [https://github.com/apache/spark/blob/branch-2.4/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile] (e.g. 2.4) but that also does not run out of the box. I am also really not sure about the affected versions since that was not transparent enough for me on GH - feel free to edit that field :) Thanks in advance! was: I am currently trying to follow the k8s instructions for Spark: [https://spark.apache.org/docs/latest/running-on-kubernetes.html] and when I clone apache/spark on GitHub on the master branch I saw multiple wrong folder references after trying to build my Docker image: *Issue 1: The comments in the Dockerfile reference the wrong folder for the Dockerfile:* {code:java} # If this docker file is being used in the context of building your images from a Spark # distribution, the docker build command should be invoked from the top level directory # of the Spark distribution. E.g.: # docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .{code} Well that docker build command simply won't run. I only got the following to run: {code:java} docker build -t spark:latest -f resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile . {code} which is the actual path to the Dockerfile. *Issue 2: jars folder does not exist* After I read the tutorial I of course build spark first as per the instructions with: {code:java} ./build/mvn -Pkubernetes -DskipTests clean package{code} Nonetheless, in the Dockerfile I get this error when building: {code:java} Step 5/18 : COPY jars /opt/spark/jars COPY failed: stat /var/lib/docker/tmp/docker-builder402673637/jars: no such file or directory{code} for which I may have found a similar issue here: [https://stackoverflow.com/questions/52451538/spark-for-kubernetes-test-on-mac] I am new
[jira] [Assigned] (SPARK-31176) Remove support for 'e'/'c' as datetime pattern charactar
[ https://issues.apache.org/jira/browse/SPARK-31176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31176: --- Assignee: Kent Yao > Remove support for 'e'/'c' as datetime pattern charactar > - > > Key: SPARK-31176 > URL: https://issues.apache.org/jira/browse/SPARK-31176 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > The meaning of 'u' was day number of week in SimpleDateFormat, it was changed > to year in DateTimeFormatter. So we keep the old meaning of 'u' by > substituting 'u' to 'e' internally and use DateTimeFormatter to parse the > pattern string. In DateTimeFormatter, the 'e' and 'c' also represents > day-of-week, we should mark them as illegal pattern character to stay the > same as before. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31176) Remove support for 'e'/'c' as datetime pattern charactar
[ https://issues.apache.org/jira/browse/SPARK-31176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31176. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27939 [https://github.com/apache/spark/pull/27939] > Remove support for 'e'/'c' as datetime pattern charactar > - > > Key: SPARK-31176 > URL: https://issues.apache.org/jira/browse/SPARK-31176 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > The meaning of 'u' was day number of week in SimpleDateFormat, it was changed > to year in DateTimeFormatter. So we keep the old meaning of 'u' by > substituting 'u' to 'e' internally and use DateTimeFormatter to parse the > pattern string. In DateTimeFormatter, the 'e' and 'c' also represents > day-of-week, we should mark them as illegal pattern character to stay the > same as before. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31182) PairRDD support aggregateByKeyWithinPartitions
zhengruifeng created SPARK-31182: Summary: PairRDD support aggregateByKeyWithinPartitions Key: SPARK-31182 URL: https://issues.apache.org/jira/browse/SPARK-31182 Project: Spark Issue Type: Improvement Components: ML, Spark Core Affects Versions: 3.1.0 Reporter: zhengruifeng When implementing `RobustScaler`, I was looking for a way to guarantee that the {{QuantileSummaries in {{}}{{aggregateByKey}}{{ are compressed before network communication. Then I only found a tricky method to work around (however not applied), and there was no method for this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases
Dongjoon Hyun created SPARK-31181: - Summary: Remove the default value assumption on CREATE TABLE test cases Key: SPARK-31181 URL: https://issues.apache.org/jira/browse/SPARK-31181 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases
[ https://issues.apache.org/jira/browse/SPARK-31181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31181: -- Affects Version/s: (was: 3.0.0) 3.1.0 > Remove the default value assumption on CREATE TABLE test cases > -- > > Key: SPARK-31181 > URL: https://issues.apache.org/jira/browse/SPARK-31181 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31136) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
[ https://issues.apache.org/jira/browse/SPARK-31136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058333#comment-17058333 ] Jungtaek Lim edited comment on SPARK-31136 at 3/18/20, 8:14 AM: This reminds me about my previous PR: [https://github.com/apache/spark/pull/27107] Please go through the comments in the PR again. I'm quoting the key point here: {quote}The parts differentiating between two syntaxes are skewSpec, rowFormat, and createFileFormat (using any of them would make create statement go into 2nd syntax), and all of them are optional. We're not enforcing to specify it but rely on the parser. {quote} I think the parser implementation around CREATE TABLE brings ambiguity which is not documented anywhere. It wasn't ambiguous because we forced to specify USE provider if it's not a Hive table. Now it's either default provider or Hive according to which options are provided, which seems to be non-trivial to reason about. (End users would never know, as it's completely from parser rule.) I feel this as the issue of "not breaking old behavior". The parser rule gets pretty much complicated due to support legacy config. Not breaking anything would make us be stuck eventually. was (Author: kabhwan): This reminds me about my previous PR: [https://github.com/apache/spark/pull/27107] Please go through the comments in the PR again. I'm quoting the key point here: {quote}The parts differentiating between two syntaxes are skewSpec, rowFormat, and createFileFormat (using any of them would make create statement go into 2nd syntax), and all of them are optional. We're not enforcing to specify it but rely on the parser. {quote} I think the parser implementation around CREATE TABLE brings ambiguity which is not documented anywhere. It wasn't ambiguous because we forced to specify STORED AS if it's not a Hive table. Now it's either default provider or Hive according to which options are provided, which seems to be non-trivial to reason about. (End users would never know, as it's completely from parser rule.) I feel this as the issue of "not breaking old behavior". The parser rule gets pretty much complicated due to support legacy config. Not breaking anything would make us be stuck eventually. > Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax > - > > Key: SPARK-31136 > URL: https://issues.apache.org/jira/browse/SPARK-31136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > We need to consider the behavior change of SPARK-30098 . > This is a placeholder to keep the discussion and the final decision. > `CREATE TABLE` syntax changes its behavior silently. > The following is one example of the breaking the existing user data pipelines. > *Apache Spark 2.4.5* > {code} > spark-sql> CREATE TABLE t(a STRING); > spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; > spark-sql> SELECT * FROM t LIMIT 1; > # Apache Spark > Time taken: 2.05 seconds, Fetched 1 row(s) > {code} > {code} > spark-sql> CREATE TABLE t(a CHAR(3)); > spark-sql> INSERT INTO TABLE t SELECT 'a '; > spark-sql> SELECT a, length(a) FROM t; > a 3 > {code} > *Apache Spark 3.0.0-preview2* > {code} > spark-sql> CREATE TABLE t(a STRING); > spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; > Error in query: LOAD DATA is not supported for datasource tables: > `default`.`t`; > {code} > {code} > spark-sql> CREATE TABLE t(a CHAR(3)); > spark-sql> INSERT INTO TABLE t SELECT 'a '; > spark-sql> SELECT a, length(a) FROM t; > a 2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31180) Implement PowerTransform
zhengruifeng created SPARK-31180: Summary: Implement PowerTransform Key: SPARK-31180 URL: https://issues.apache.org/jira/browse/SPARK-31180 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.1.0 Reporter: zhengruifeng {color:#5a6e5a}Power transforms are a family of parametric, monotonic transformations {color}{color:#5a6e5a}that are applied to make data more Gaussian-like. This is useful for {color}{color:#5a6e5a}modeling issues related to heteroscedasticity (non-constant variance), {color}{color:#5a6e5a}or other situations where normality is desired.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org