[jira] [Created] (SPARK-31186) toPandas fails on simple query (collect() works)

2020-03-18 Thread Michael Chirico (Jira)
Michael Chirico created SPARK-31186:
---

 Summary: toPandas fails on simple query (collect() works)
 Key: SPARK-31186
 URL: https://issues.apache.org/jira/browse/SPARK-31186
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Michael Chirico


My pandas is 0.25.1.

I ran the following simple code (cross joins are enabled):

{code:python}
spark.sql('''
select t1.*, t2.* from (
  select explode(sequence(1, 3)) v
) t1 left join (
  select explode(sequence(1, 3)) v
) t2
''').toPandas()
{code}

and got a ValueError from pandas:

> ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), 
> a.item(), a.any() or a.all().

Collect works fine:

{code:python}
spark.sql('''
select * from (
  select explode(sequence(1, 3)) v
) t1 left join (
  select explode(sequence(1, 3)) v
) t2
''').collect()
# [Row(v=1, v=1),
#  Row(v=1, v=2),
#  Row(v=1, v=3),
#  Row(v=2, v=1),
#  Row(v=2, v=2),
#  Row(v=2, v=3),
#  Row(v=3, v=1),
#  Row(v=3, v=2),
#  Row(v=3, v=3)]
{code}

I imagine it's related to the duplicate column names, but this doesn't fail:

{code:python}
spark.sql("select 1 v, 1 v").toPandas()
# v v
# 0 1   1
{code}

Also no issue for multiple rows:

spark.sql("select 1 v, 1 v union all select 1 v, 2 v").toPandas()

It also works when not using a cross join but a janky programatically-generated 
union all query:

{code:python}
cond = []
for ii in range(3):
for jj in range(3):
cond.append(f'select {ii+1} v, {jj+1} v')
spark.sql(' union all '.join(cond)).toPandas()
{code}

As near as I can tell, the output is identical to the explode output, making 
this issue all the more peculiar, as I thought toPandas() is applied to the 
output of collect(), so if collect() gives the same output, how can toPandas() 
fail in one case and not the other? Further, the lazy DataFrame is the same: 
DataFrame[v: int, v: int] in both cases. I must be missing something.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31159) Incompatible Parquet dates/timestamps with Spark 2.4

2020-03-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31159:

Parent: SPARK-30951
Issue Type: Sub-task  (was: Bug)

> Incompatible Parquet dates/timestamps with Spark 2.4
> 
>
> Key: SPARK-31159
> URL: https://issues.apache.org/jira/browse/SPARK-31159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write dates/timestamps to Parquet file in Spark 2.4:
> {code}
> $ export TZ="UTC"
> $ ~/spark-2.4/bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5
>   /_/
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_231)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
> scala> val df = Seq(("1001-01-01", "1001-01-01 
> 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), 
> $"tsS".cast("timestamp").as("ts"))
> df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]
> scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
> scala> spark.conf.set("spark.sql.parquet.outputTimestampType", 
> "TIMESTAMP_MICROS")
> scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
> scala> 
> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
> +--+--+
> |d |ts|
> +--+--+
> |1001-01-01|1001-01-01 01:02:03.123456|
> +--+--+
> {code}
> Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool 
> prints *1001-01-07* and *1001-01-07T01:02:03.123456+*:
> {code}
> $ java -jar 
> /Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar
>  dump -m 
> ./2_4_5_micros/part-0-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet
> INT32 d
> 
> *** row group 1 of 1, values 1 to 1 ***
> value 1: R:0 D:1 V:1001-01-07
> INT64 ts
> 
> *** row group 1 of 1, values 1 to 1 ***
> value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+
> {code}
> Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but 
> different values from Spark 2.4:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
>   /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_231)
> scala> 
> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
> +--+--+
> |d |ts|
> +--+--+
> |1001-01-07|1001-01-07 01:02:03.123456|
> +--+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4

2020-03-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31183:

Parent: SPARK-30951
Issue Type: Sub-task  (was: Bug)

> Incompatible Avro dates/timestamps with Spark 2.4
> -
>
> Key: SPARK-31183
> URL: https://issues.apache.org/jira/browse/SPARK-31183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write dates/timestamps to Avro file in Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5
> {code}
> {code:scala}
> scala> 
> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-01|
> +--+
> scala> 
> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-01 01:02:03.123456|
> +--+
> {code}
> Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from 
> Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5
> {code}
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-07|
> +--+
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-07 01:09:05.123456|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31076) Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time

2020-03-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31076:

Parent: SPARK-30951
Issue Type: Sub-task  (was: Improvement)

> Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time
> 
>
> Key: SPARK-31076
> URL: https://issues.apache.org/jira/browse/SPARK-31076
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> By default, collect() returns java.sql.Timestamp/Date instances with offsets 
> derived from internal values of Catalyst's TIMESTAMP/DATE that store 
> microseconds since the epoch. The conversion from internal values to 
> java.sql.Timestamp/Date based on Proleptic Gregorian calendar but converting 
> the resulted values before 1582 year to strings produces timestamp/date 
> string in Julian calendar. For example:
> {code}
> scala> sql("select date '1100-10-10'").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
> {code} 
> This can be fixed if internal Catalyst's values are converted to local 
> date-time in Gregorian calendar,  and construct local date-time from the 
> resulted year, month, ..., seconds in Julian calendar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28626) Spark leaves unencrypted data on local disk, even with encryption turned on (CVE-2019-10099)

2020-03-18 Thread Wing Yew Poon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062230#comment-17062230
 ] 

Wing Yew Poon commented on SPARK-28626:
---

For the record, to assist folks who need to backport this:
>From branch-2.3, we also need 
>[https://github.com/apache/spark/commit/323dc3ad02e63a7c99b5bd6da618d6020657ecba]
[PYSPARK] Update py4j to version 0.10.7.
For the SPARKR change, there is a preceding change that is needed
[https://github.com/apache/spark/commit/dad5c48b2a229bf6f9e6b8548f9335f04a15c818]
[MINOR][PYTHON] Use a helper in `PythonUtils` instead of direct accessing Scala 
package


> Spark leaves unencrypted data on local disk, even with encryption turned on 
> (CVE-2019-10099)
> 
>
> Key: SPARK-28626
> URL: https://issues.apache.org/jira/browse/SPARK-28626
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.3.2
>Reporter: Imran Rashid
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>
> Severity: Important
>  
> Vendor: The Apache Software Foundation
>  
> Versions affected:
> All Spark 1.x, Spark 2.0.x, Spark 2.1.x, and 2.2.x versions
> Spark 2.3.0 to 2.3.2
>  
> Description:
> Prior to Spark 2.3.3, in certain situations Spark would write user data to 
> local disk unencrypted, even if spark.io.encryption.enabled=true.  This 
> includes cached blocks that are fetched to disk (controlled by 
> spark.maxRemoteBlockSizeFetchToMem); in SparkR, using parallelize; in 
> Pyspark, using broadcast and parallelize; and use of python udfs.
>  
>  
> Mitigation:
> 1.x, 2.0.x, 2.1.x, 2.2.x, 2.3.x  users should upgrade to 2.3.3 or newer, 
> including 2.4.x
>  
> Credit:
> This issue was reported by Thomas Graves of NVIDIA.
>  
> References:
> [https://spark.apache.org/security.html]
>  
> The following commits were used to fix this issue, in branch-2.3 (there may 
> be other commits in master / branch-2.4, that are equivalent.)
> {noformat}
> commit 575fea120e25249716e3f680396580c5f9e26b5b
> Author: Imran Rashid 
> Date:   Wed Aug 22 16:38:28 2018 -0500
>     [CORE] Updates to remote cache reads
>     Covered by tests in DistributedSuite
>  
> commit 6d742d1bd71aa3803dce91a830b37284cb18cf70
> Author: Imran Rashid 
> Date:   Thu Sep 6 12:11:47 2018 -0500
>     [PYSPARK][SQL] Updates to RowQueue
>     Tested with updates to RowQueueSuite
>  
> commit 09dd34cb1706f2477a89174d6a1a0f17ed5b0a65
> Author: Imran Rashid 
> Date:   Mon Aug 13 21:35:34 2018 -0500 
>     [PYSPARK] Updates to pyspark broadcast
>  
> commit 12717ba0edfa5459c9ac2085f46b1ecc0ee759aa
> Author: hyukjinkwon 
> Date:   Mon Sep 24 19:25:02 2018 +0800 
>     [SPARKR] Match pyspark features in SparkR communication protocol
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27941) Serverless Spark in the Cloud

2020-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062198#comment-17062198
 ] 

Dongjoon Hyun commented on SPARK-27941:
---

Hi, [~mu5358271].
Is there any update on this issue?

> Serverless Spark in the Cloud
> -
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Shuheng Dai
>Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner 
> and remove the need to provision, maintain and manage a cluster. POC: 
> [https://github.com/mu5358271/spark-on-fargate]
> While it might not make sense for Spark to favor any particular cloud 
> provider or to support a large number of cloud providers natively, it would 
> make sense to make some of the internal Spark components more pluggable and 
> cloud friendly so that it is easier for various cloud providers to integrate. 
> For example, 
>  * authentication: IO and network encryption requires authentication via 
> securely sharing a secret, and the implementation of this is currently tied 
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
> mounted on all pods. These can be decoupled so it is possible to swap in 
> implementation using public cloud. In the POC, this is implemented by passing 
> around AWS KMS encrypted secret and decrypting the secret at each executor, 
> which delegate authentication and authorization to the cloud.
>  * deployment & scheduler: adding a new cluster manager and scheduler backend 
> requires changing a number of places in the Spark core package and rebuilding 
> the entire project. Having a pluggable scheduler per 
> https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
> different scheduler backends backed by different cloud providers.
>  * client-cluster communication: I am not very familiar with the network part 
> of the code base so I might be wrong on this. My understanding is that the 
> code base assumes that the client and the cluster are on the same network and 
> the nodes communicate with each other via hostname/ip. For security best 
> practice, it is advised to run the executors in a private protected network, 
> which may be separate from the client machine's network. Since we are 
> serverless, that means the client need to first launch the driver into the 
> private network, and the driver in turn start the executors, potentially 
> doubling job initialization time. This can be solved by dropping complete 
> serverlessness and having a persistent host in the private network, or (I do 
> not have a POC, so I am not sure if this actually works) by implementing 
> client-cluster communication via message queues in the cloud to get around 
> the network separation.
>  * shuffle storage and retrieval: external shuffle in yarn relies on the 
> existence of a persistent cluster that continues to serve shuffle files 
> beyond the lifecycle of the executors. This assumption no longer holds in a 
> serverless cluster with only transient containers. Pluggable remote shuffle 
> storage per https://issues.apache.org/jira/browse/SPARK-25299 would make it 
> easier to introduce new cloud-backed shuffle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31178) sql("INSERT INTO v2DataSource ...").collect() double inserts

2020-03-18 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-31178.
-
Fix Version/s: 3.0.0
 Assignee: Burak Yavuz
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/27941]

> sql("INSERT INTO v2DataSource ...").collect() double inserts
> 
>
> Key: SPARK-31178
> URL: https://issues.apache.org/jira/browse/SPARK-31178
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The following unit test fails in DataSourceV2SQLSuite:
> {code:java}
> test("do not double insert on INSERT INTO collect()") {
>   import testImplicits._
>   val t1 = s"${catalogAndNamespace}tbl"
>   sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
>   val tmpView = "test_data"
>   val df = Seq((1L, "a"), (2L, "b"), (3L, "c")).toDF("id", "data")
>   df.createOrReplaceTempView(tmpView)
>   sql(s"INSERT INTO TABLE $t1 SELECT * FROM $tmpView").collect()
>   verifyTable(t1, df)
> } {code}
> The INSERT INTO is double inserting when ".collect()" is called. I think this 
> is because the V2 SparkPlans are not commands, and doExecute on a Spark plan 
> can be called multiple times causing data to be inserted multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31185) implement VarianceThresholdSelector

2020-03-18 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31185:
--

 Summary: implement VarianceThresholdSelector
 Key: SPARK-31185
 URL: https://issues.apache.org/jira/browse/SPARK-31185
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Implement a Feature selector that removes all low-variance features. Features 
with a
 variance lower than the threshold will be removed. The default is to keep all 
features with non-zero variance, i.e. remove the features that have the same 
value in all samples.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30933) ML, GraphX 3.0 QA: Update user guide for new features & APIs

2020-03-18 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30933.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27880
[https://github.com/apache/spark/pull/27880]

> ML, GraphX 3.0 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-30933
> URL: https://issues.apache.org/jira/browse/SPARK-30933
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.0.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
>  * Create a JIRA for that feature, and assign it to the author of the feature
>  * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
>  * This task does not include major reorganizations for the programming guide.
>  * We should now begin copying algorithm details from the spark.mllib guide 
> to spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30933) ML, GraphX 3.0 QA: Update user guide for new features & APIs

2020-03-18 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30933:


Assignee: Huaxin Gao  (was: zhengruifeng)

> ML, GraphX 3.0 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-30933
> URL: https://issues.apache.org/jira/browse/SPARK-30933
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
>  * Create a JIRA for that feature, and assign it to the author of the feature
>  * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
>  * This task does not include major reorganizations for the programming guide.
>  * We should now begin copying algorithm details from the spark.mllib guide 
> to spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30933) ML, GraphX 3.0 QA: Update user guide for new features & APIs

2020-03-18 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30933:


Assignee: zhengruifeng

> ML, GraphX 3.0 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-30933
> URL: https://issues.apache.org/jira/browse/SPARK-30933
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
>  * Create a JIRA for that feature, and assign it to the author of the feature
>  * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
>  * This task does not include major reorganizations for the programming guide.
>  * We should now begin copying algorithm details from the spark.mllib guide 
> to spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31184) Support getTablesByType API of Hive Client

2020-03-18 Thread Xin Wu (Jira)
Xin Wu created SPARK-31184:
--

 Summary: Support getTablesByType API of Hive Client
 Key: SPARK-31184
 URL: https://issues.apache.org/jira/browse/SPARK-31184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xin Wu


Hive 2.3+ supports getTablesByType API, which is a precondition to implement 
SHOW VIEWS in HiveExternalCatalog. Currently, without this API, we can not get 
hive table with type HiveTableType.VIRTUAL_VIEW directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2020-03-18 Thread Reynold Xin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061916#comment-17061916
 ] 

Reynold Xin commented on SPARK-25728:
-

It's too big of a change and realistically speaking probably only a few people 
in the world that can do this well. I'm going to close it.

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry is to start a discussion about adding structure intermediate 
> representation for generating Java code from a program using DataFrame or 
> Dataset API, in addition to the current String-based representation.
> This addition is based on the discussions in [a 
> thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].
> Please feel free to comment on this JIRA entry or [Google 
> Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
>  too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2020-03-18 Thread Reynold Xin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-25728.
-
Resolution: Won't Fix

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry is to start a discussion about adding structure intermediate 
> representation for generating Java code from a program using DataFrame or 
> Dataset API, in addition to the current String-based representation.
> This addition is based on the discussions in [a 
> thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].
> Please feel free to comment on this JIRA entry or [Google 
> Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
>  too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2020-03-18 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061914#comment-17061914
 ] 

Kazuaki Ishizaki commented on SPARK-25728:
--

For now, no update on my side.

I am happy if the community or PMCs want to revitalize it. Otherwise, should I 
close this?

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry is to start a discussion about adding structure intermediate 
> representation for generating Java code from a program using DataFrame or 
> Dataset API, in addition to the current String-based representation.
> This addition is based on the discussions in [a 
> thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].
> Please feel free to comment on this JIRA entry or [Google 
> Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
>  too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4

2020-03-18 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061902#comment-17061902
 ] 

Maxim Gekk commented on SPARK-31183:


I am working on the issue.

> Incompatible Avro dates/timestamps with Spark 2.4
> -
>
> Key: SPARK-31183
> URL: https://issues.apache.org/jira/browse/SPARK-31183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write dates/timestamps to Avro file in Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5
> {code}
> {code:scala}
> scala> 
> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-01|
> +--+
> scala> 
> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-01 01:02:03.123456|
> +--+
> {code}
> Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from 
> Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5
> {code}
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-07|
> +--+
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-07 01:09:05.123456|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4

2020-03-18 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061903#comment-17061903
 ] 

Maxim Gekk commented on SPARK-31183:


[~cloud_fan] FYI

> Incompatible Avro dates/timestamps with Spark 2.4
> -
>
> Key: SPARK-31183
> URL: https://issues.apache.org/jira/browse/SPARK-31183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write dates/timestamps to Avro file in Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5
> {code}
> {code:scala}
> scala> 
> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-01|
> +--+
> scala> 
> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-01 01:02:03.123456|
> +--+
> {code}
> Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from 
> Spark 2.4.5:
> {code}
> $ export TZ="America/Los_Angeles"
> $ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5
> {code}
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
> +--+
> |date  |
> +--+
> |1001-01-07|
> +--+
> scala> 
> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
> +--+
> |ts|
> +--+
> |1001-01-07 01:09:05.123456|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31183) Incompatible Avro dates/timestamps with Spark 2.4

2020-03-18 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31183:
--

 Summary: Incompatible Avro dates/timestamps with Spark 2.4
 Key: SPARK-31183
 URL: https://issues.apache.org/jira/browse/SPARK-31183
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Write dates/timestamps to Avro file in Spark 2.4.5:
{code}
$ export TZ="America/Los_Angeles"
$ bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5
{code}
{code:scala}
scala> 
df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")

scala> 
spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
+--+
|date  |
+--+
|1001-01-01|
+--+


scala> 
df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")

scala> 
spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
+--+
|ts|
+--+
|1001-01-01 01:02:03.123456|
+--+
{code}

Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) outputs different values from Spark 
2.4.5:
{code}
$ export TZ="America/Los_Angeles"
$ /bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5
{code}
{code:scala}
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> 
spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+--+
|date  |
+--+
|1001-01-07|
+--+

scala> 
spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_ts_avro").show(false)
+--+
|ts|
+--+
|1001-01-07 01:09:05.123456|
+--+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31175) Avoid creating reverse comparator for each compare in InterpretedOrdering

2020-03-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31175:
---

Assignee: wuyi

> Avoid creating reverse comparator for each compare in InterpretedOrdering
> -
>
> Key: SPARK-31175
> URL: https://issues.apache.org/jira/browse/SPARK-31175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Currently, we'll create a new reverse comparator for each compare in 
> InterpretedOrdering, which could generate lots of small and instant object to 
> hurt JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31175) Avoid creating reverse comparator for each compare in InterpretedOrdering

2020-03-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31175.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27938
[https://github.com/apache/spark/pull/27938]

> Avoid creating reverse comparator for each compare in InterpretedOrdering
> -
>
> Key: SPARK-31175
> URL: https://issues.apache.org/jira/browse/SPARK-31175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, we'll create a new reverse comparator for each compare in 
> InterpretedOrdering, which could generate lots of small and instant object to 
> hurt JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30295) Remove Hive dependencies from SparkSQLCLI

2020-03-18 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30295.
--
Resolution: Won't Fix

> Remove Hive dependencies from SparkSQLCLI
> -
>
> Key: SPARK-30295
> URL: https://issues.apache.org/jira/browse/SPARK-30295
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Javier Fuentes
>Priority: Major
>
> Removal of unnecessary hive dependencies for the Spark SQL Client. Replacing 
> that with a native Scala implementation. For the client driver, argument 
> parser and SparkSqlCliDriver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31165) Multiple wrong references in Dockerfile for k8s

2020-03-18 Thread Nikolay Dimolarov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Dimolarov updated SPARK-31165:
--
Description: 
I am currently trying to follow the k8s instructions for Spark: 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html] and when I 
clone apache/spark on GitHub on the master branch I saw multiple wrong folder 
references after trying to build my Docker image:

 

*Issue 1: The comments in the Dockerfile reference the wrong folder for the 
Dockerfile:*
{code:java}
# If this docker file is being used in the context of building your images from 
a Spark # distribution, the docker build command should be invoked from the top 
level directory # of the Spark distribution. E.g.: # docker build -t 
spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .{code}
Well that docker build command simply won't run. I only got the following to 
run:
{code:java}
docker build -t spark:latest -f 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile . 
{code}
which is the actual path to the Dockerfile.

 

*Issue 2: jars folder does not exist*

After I read the tutorial I of course build spark first as per the instructions 
with:
{code:java}
./build/mvn -Pkubernetes -DskipTests clean package{code}
Nonetheless, in the Dockerfile I get this error when building:
{code:java}
Step 5/18 : COPY jars /opt/spark/jars
COPY failed: stat /var/lib/docker/tmp/docker-builder402673637/jars: no such 
file or directory{code}
 for which I may have found a similar issue here: 
[https://stackoverflow.com/questions/52451538/spark-for-kubernetes-test-on-mac]

I am new to Spark but I assume that this jars folder - if the build step would 
actually make it and I ran the maven build of the master branch successfully 
with the command I mentioned above - would exist in the root folder of the 
project. Turns out it's here:

spark/assembly/target/scala-2.12/jars

 

*Issue 3: missing entrypoint.sh and decom.sh due to wrong reference*

While Issue 2 remains unresolved as I can't wrap my head around the missing 
jars folder (bin and sbin got copied successfully after I made a dummy jars 
folder) I then got stuck on these 2 steps:
{code:java}
COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/ COPY 
kubernetes/dockerfiles/spark/decom.sh /opt/{code}
 
 with:
  
{code:java}
Step 8/18 : COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY failed: stat 
/var/lib/docker/tmp/docker-builder638219776/kubernetes/dockerfiles/spark/entrypoint.sh:
 no such file or directory{code}
 
 which makes sense since the path should actually be:
  
 resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh
 resource-managers/kubernetes/docker/src/main/dockerfiles/spark/decom.sh
  
 *Remark*
  
 I only created one issue since this seems like somebody cleaned up the repo 
and forgot to change these. Am I missing something here? If I am, I apologise 
in advance since I am new to the Spark project. I also saw that some of these 
references were handled through vars in previous branches: 
[https://github.com/apache/spark/blob/branch-2.4/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile]
 (e.g. 2.4) but that also does not run out of the box.
  
 I am also really not sure about the affected versions since that was not 
transparent enough for me on GH - feel free to edit that field :) 
  
 Thanks in advance!
  
  
  

  was:
I am currently trying to follow the k8s instructions for Spark: 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html] and when I 
clone apache/spark on GitHub on the master branch I saw multiple wrong folder 
references after trying to build my Docker image:

 

*Issue 1: The comments in the Dockerfile reference the wrong folder for the 
Dockerfile:*
{code:java}
# If this docker file is being used in the context of building your images from 
a Spark # distribution, the docker build command should be invoked from the top 
level directory # of the Spark distribution. E.g.: # docker build -t 
spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .{code}
Well that docker build command simply won't run. I only got the following to 
run:
{code:java}
docker build -t spark:latest -f 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile . 
{code}
which is the actual path to the Dockerfile.

 

*Issue 2: jars folder does not exist*

After I read the tutorial I of course build spark first as per the instructions 
with:
{code:java}
./build/mvn -Pkubernetes -DskipTests clean package{code}
Nonetheless, in the Dockerfile I get this error when building:
{code:java}
Step 5/18 : COPY jars /opt/spark/jars
COPY failed: stat /var/lib/docker/tmp/docker-builder402673637/jars: no such 
file or directory{code}
 for which I may have found a similar issue here: 
[https://stackoverflow.com/questions/52451538/spark-for-kubernetes-test-on-mac]

I am new 

[jira] [Assigned] (SPARK-31176) Remove support for 'e'/'c' as datetime pattern charactar

2020-03-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31176:
---

Assignee: Kent Yao

> Remove support for 'e'/'c' as datetime pattern charactar 
> -
>
> Key: SPARK-31176
> URL: https://issues.apache.org/jira/browse/SPARK-31176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> The meaning of 'u' was day number of week in SimpleDateFormat, it was changed 
> to year in DateTimeFormatter.  So we keep the old meaning of 'u' by 
> substituting 'u' to 'e' internally and use DateTimeFormatter to parse the 
> pattern string. In DateTimeFormatter, the 'e' and 'c' also represents 
> day-of-week, we should mark them as illegal pattern character to stay the 
> same as before. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31176) Remove support for 'e'/'c' as datetime pattern charactar

2020-03-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31176.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27939
[https://github.com/apache/spark/pull/27939]

> Remove support for 'e'/'c' as datetime pattern charactar 
> -
>
> Key: SPARK-31176
> URL: https://issues.apache.org/jira/browse/SPARK-31176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> The meaning of 'u' was day number of week in SimpleDateFormat, it was changed 
> to year in DateTimeFormatter.  So we keep the old meaning of 'u' by 
> substituting 'u' to 'e' internally and use DateTimeFormatter to parse the 
> pattern string. In DateTimeFormatter, the 'e' and 'c' also represents 
> day-of-week, we should mark them as illegal pattern character to stay the 
> same as before. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31182) PairRDD support aggregateByKeyWithinPartitions

2020-03-18 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-31182:


 Summary: PairRDD support aggregateByKeyWithinPartitions
 Key: SPARK-31182
 URL: https://issues.apache.org/jira/browse/SPARK-31182
 Project: Spark
  Issue Type: Improvement
  Components: ML, Spark Core
Affects Versions: 3.1.0
Reporter: zhengruifeng


When implementing `RobustScaler`, I was looking for a way to guarantee that the 
{{QuantileSummaries in {{}}{{aggregateByKey}}{{ are compressed before 
network communication.

Then I only found a tricky method to work around (however not applied), and 
there was no method for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases

2020-03-18 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-31181:
-

 Summary: Remove the default value assumption on CREATE TABLE test 
cases
 Key: SPARK-31181
 URL: https://issues.apache.org/jira/browse/SPARK-31181
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31181) Remove the default value assumption on CREATE TABLE test cases

2020-03-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31181:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Remove the default value assumption on CREATE TABLE test cases
> --
>
> Key: SPARK-31181
> URL: https://issues.apache.org/jira/browse/SPARK-31181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31136) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-03-18 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058333#comment-17058333
 ] 

Jungtaek Lim edited comment on SPARK-31136 at 3/18/20, 8:14 AM:


This reminds me about my previous PR:

[https://github.com/apache/spark/pull/27107]

Please go through the comments in the PR again. I'm quoting the key point here:
{quote}The parts differentiating between two syntaxes are skewSpec, rowFormat, 
and createFileFormat (using any of them would make create statement go into 2nd 
syntax), and all of them are optional. We're not enforcing to specify it but 
rely on the parser.
{quote}
I think the parser implementation around CREATE TABLE brings ambiguity which is 
not documented anywhere. It wasn't ambiguous because we forced to specify USE 
provider if it's not a Hive table. Now it's either default provider or Hive 
according to which options are provided, which seems to be non-trivial to 
reason about. (End users would never know, as it's completely from parser rule.)

I feel this as the issue of "not breaking old behavior". The parser rule gets 
pretty much complicated due to support legacy config. Not breaking anything 
would make us be stuck eventually.


was (Author: kabhwan):
This reminds me about my previous PR:

[https://github.com/apache/spark/pull/27107]

Please go through the comments in the PR again. I'm quoting the key point here:
{quote}The parts differentiating between two syntaxes are skewSpec, rowFormat, 
and createFileFormat (using any of them would make create statement go into 2nd 
syntax), and all of them are optional. We're not enforcing to specify it but 
rely on the parser.
{quote}
I think the parser implementation around CREATE TABLE brings ambiguity which is 
not documented anywhere. It wasn't ambiguous because we forced to specify 
STORED AS if it's not a Hive table. Now it's either default provider or Hive 
according to which options are provided, which seems to be non-trivial to 
reason about. (End users would never know, as it's completely from parser rule.)

I feel this as the issue of "not breaking old behavior". The parser rule gets 
pretty much complicated due to support legacy config. Not breaking anything 
would make us be stuck eventually.

> Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
> -
>
> Key: SPARK-31136
> URL: https://issues.apache.org/jira/browse/SPARK-31136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> We need to consider the behavior change of SPARK-30098 .
> This is a placeholder to keep the discussion and the final decision.
> `CREATE TABLE` syntax changes its behavior silently.
> The following is one example of the breaking the existing user data pipelines.
> *Apache Spark 2.4.5*
> {code}
> spark-sql> CREATE TABLE t(a STRING);
> spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;
> spark-sql> SELECT * FROM t LIMIT 1;
> # Apache Spark
> Time taken: 2.05 seconds, Fetched 1 row(s)
> {code}
> {code}
> spark-sql> CREATE TABLE t(a CHAR(3));
> spark-sql> INSERT INTO TABLE t SELECT 'a ';
> spark-sql> SELECT a, length(a) FROM t;
> a 3
> {code}
> *Apache Spark 3.0.0-preview2*
> {code}
> spark-sql> CREATE TABLE t(a STRING);
> spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;
> Error in query: LOAD DATA is not supported for datasource tables: 
> `default`.`t`;
> {code}
> {code}
> spark-sql> CREATE TABLE t(a CHAR(3));
> spark-sql> INSERT INTO TABLE t SELECT 'a ';
> spark-sql> SELECT a, length(a) FROM t;
> a 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31180) Implement PowerTransform

2020-03-18 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-31180:


 Summary: Implement PowerTransform
 Key: SPARK-31180
 URL: https://issues.apache.org/jira/browse/SPARK-31180
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.0
Reporter: zhengruifeng


{color:#5a6e5a}Power transforms are a family of parametric, monotonic 
transformations
{color}{color:#5a6e5a}that are applied to make data more Gaussian-like. This is 
useful for
{color}{color:#5a6e5a}modeling issues related to heteroscedasticity 
(non-constant variance),
{color}{color:#5a6e5a}or other situations where normality is desired.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org