date:20200409

[jira] [Commented] (SPARK-31399) closure cleaner is broken in Spark 3.0

2020-04-09 Thread Reynold Xin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080265#comment-17080265
 ] 

Reynold Xin commented on SPARK-31399:
-

This is bad... [~sowen] and [~joshrosen]  did you look into this in the past?

> closure cleaner is broken in Spark 3.0
> --
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31402) Incorrect rebasing of BCE dates

2020-04-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31402:

Parent: SPARK-31404
Issue Type: Sub-task  (was: Bug)

> Incorrect rebasing of BCE dates
> ---
>
> Key: SPARK-31402
> URL: https://issues.apache.org/jira/browse/SPARK-31402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Dates of before common era are rebased incorrectly, see 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120679/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql/
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> postgreSQL/date.sql
> Expected "[-0044]-03-15", but got "[0045]-03-15" Result did not match for 
> query #93
> select make_date(-44, 3, 15)
> {code}
> Even such dates are out of the valid range of dates supported by the DATE 
> type, there is a test in postgreSQL/date.sql for a negative year, and it 
> would be nice to fix the issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set

2020-04-09 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080257#comment-17080257
 ] 

JinxinTang edited comment on SPARK-31386 at 4/10/20, 5:32 AM:
--

or could you please try the spark 2.4.5, also no problem could be reproduced

[download 
link|https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz]


was (Author: jinxintang):
or could you please try the spark 2.4.5, also no problem could be reproduces

[download 
link|https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz]

> Reading broadcast in UDF raises MemoryError when 
> spark.executor.pyspark.memory is set
> -
>
> Key: SPARK-31386
> URL: https://issues.apache.org/jira/browse/SPARK-31386
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4 or AWS EMR
> `pyspark --conf spark.executor.pyspark.memory=500m`
>Reporter: Viacheslav Krot
>Priority: Major
> Attachments: 选区_267.png
>
>
> Following code with udf causes MemoryError when 
> `spark.executor.pyspark.memory` is set
> ```
> from pyspark.sql.types import BooleanType
>  from pyspark.sql.functions import udf
> df = spark.createDataFrame([
>    ('Alice', 10),
>    ('Bob', 12)
>  ], ['name', 'cnt'])
> broadcast = spark.sparkContext.broadcast([1,2,3])
> @udf(BooleanType())
>  def f(cnt):
>    return cnt < len(broadcast.value)
> df.filter(f(df.cnt)).count()
> ```
> Same code work well when spark.executor.pyspark.memory is not set. 
> The code by itself does not make any sense, just simplest code to reproduce 
> the bug.
>  
> Error:
> ```
> 20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, 
> ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 
> 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 377, in main    process()  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 372, in process    serializer.dump_stream(func(split_index, iterator), 
> outfile)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 345, in dump_stream    
> self.serializer.dump_stream(self._batched(iterator), stream)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 141, in dump_stream    for obj in iterator:  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 334, in _batched    for item in iterator:  File "", line 1, in 
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 85, in     return lambda *a: f(*a)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py",
>  line 113, in wrapper    return f(*args, **kwargs)  File "", line 3, 
> in f  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
>  line 148, in value    self._value = self.load_from_path(self._path)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
>  line 124, in load_from_path    with open(path, 'rb', 1 << 20) as 
> f:MemoryError
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>  at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
>  at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
>

[jira] [Commented] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set

2020-04-09 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080257#comment-17080257
 ] 

JinxinTang commented on SPARK-31386:


or could you please try the spark 2.4.5, also no problem could be reproduces

[download 
link|https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz]

> Reading broadcast in UDF raises MemoryError when 
> spark.executor.pyspark.memory is set
> -
>
> Key: SPARK-31386
> URL: https://issues.apache.org/jira/browse/SPARK-31386
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4 or AWS EMR
> `pyspark --conf spark.executor.pyspark.memory=500m`
>Reporter: Viacheslav Krot
>Priority: Major
> Attachments: 选区_267.png
>
>
> Following code with udf causes MemoryError when 
> `spark.executor.pyspark.memory` is set
> ```
> from pyspark.sql.types import BooleanType
>  from pyspark.sql.functions import udf
> df = spark.createDataFrame([
>    ('Alice', 10),
>    ('Bob', 12)
>  ], ['name', 'cnt'])
> broadcast = spark.sparkContext.broadcast([1,2,3])
> @udf(BooleanType())
>  def f(cnt):
>    return cnt < len(broadcast.value)
> df.filter(f(df.cnt)).count()
> ```
> Same code work well when spark.executor.pyspark.memory is not set. 
> The code by itself does not make any sense, just simplest code to reproduce 
> the bug.
>  
> Error:
> ```
> 20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, 
> ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 
> 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 377, in main    process()  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 372, in process    serializer.dump_stream(func(split_index, iterator), 
> outfile)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 345, in dump_stream    
> self.serializer.dump_stream(self._batched(iterator), stream)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 141, in dump_stream    for obj in iterator:  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 334, in _batched    for item in iterator:  File "", line 1, in 
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 85, in     return lambda *a: f(*a)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py",
>  line 113, in wrapper    return f(*args, **kwargs)  File "", line 3, 
> in f  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
>  line 148, in value    self._value = self.load_from_path(self._path)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
>  line 124, in load_from_path    with open(path, 'rb', 1 << 20) as 
> f:MemoryError
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>  at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
>  at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
>  Source) at 
>

[jira] [Assigned] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default

2020-04-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-24963:
-

Assignee: Matthew Cheah

> Integration tests will fail if they run in a namespace not being the default
> 
>
> Key: SPARK-24963
> URL: https://issues.apache.org/jira/browse/SPARK-24963
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Matthew Cheah
>Priority: Minor
> Fix For: 2.4.0
>
>
> Related discussion is here: 
> [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893]
> If  spark-rbac.yaml is used when tests are used locally, client mode tests 
> will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31401) Add JDK11 example in `bin/docker-image-tool.sh` usage

2020-04-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31401.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28171
[https://github.com/apache/spark/pull/28171]

> Add JDK11 example in `bin/docker-image-tool.sh` usage
> -
>
> Key: SPARK-31401
> URL: https://issues.apache.org/jira/browse/SPARK-31401
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code:java}
> $ bin/docker-image-tool.sh -h
> ...
>   - Build and push JDK11-based image with tag "v3.0.0" to docker.io/myrepo
> bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 -b 
> java_image_tag=11-jre-slim build
> bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 push {code}
>  
> {code:java}
> $ docker run -it --rm docker.io/myrepo/spark:v3.0.0 java --version | tail -n3
> openjdk 11.0.6 2020-01-14
> OpenJDK Runtime Environment 18.9 (build 11.0.6+10)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.6+10, mixed mode) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31404) backward compatibility issues after switching to Proleptic Gregorian calendar

2020-04-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31404:

Summary: backward compatibility issues after switching to Proleptic 
Gregorian calendar  (was: backward compatibility after switching to Proleptic 
Gregorian calendar)

> backward compatibility issues after switching to Proleptic Gregorian calendar
> -
>
> Key: SPARK-31404
> URL: https://issues.apache.org/jira/browse/SPARK-31404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java 
> 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but 
> introduces some backward compatibility problems:
> 1. may read wrong data from the data files written by Spark 2.4
> 2. may have perf regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set

2020-04-09 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080205#comment-17080205
 ] 

JinxinTang commented on SPARK-31386:


is it already fix in the 2.4.4 version?

 it seems ok both in local mode and yarn mode,

could you please provide other detail informations if this bug still live，or 
could you try the version from spark release version and reproduce in local 
enviroment

> Reading broadcast in UDF raises MemoryError when 
> spark.executor.pyspark.memory is set
> -
>
> Key: SPARK-31386
> URL: https://issues.apache.org/jira/browse/SPARK-31386
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4 or AWS EMR
> `pyspark --conf spark.executor.pyspark.memory=500m`
>Reporter: Viacheslav Krot
>Priority: Major
> Attachments: 选区_267.png
>
>
> Following code with udf causes MemoryError when 
> `spark.executor.pyspark.memory` is set
> ```
> from pyspark.sql.types import BooleanType
>  from pyspark.sql.functions import udf
> df = spark.createDataFrame([
>    ('Alice', 10),
>    ('Bob', 12)
>  ], ['name', 'cnt'])
> broadcast = spark.sparkContext.broadcast([1,2,3])
> @udf(BooleanType())
>  def f(cnt):
>    return cnt < len(broadcast.value)
> df.filter(f(df.cnt)).count()
> ```
> Same code work well when spark.executor.pyspark.memory is not set. 
> The code by itself does not make any sense, just simplest code to reproduce 
> the bug.
>  
> Error:
> ```
> 20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, 
> ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 
> 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 377, in main    process()  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 372, in process    serializer.dump_stream(func(split_index, iterator), 
> outfile)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 345, in dump_stream    
> self.serializer.dump_stream(self._batched(iterator), stream)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 141, in dump_stream    for obj in iterator:  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 334, in _batched    for item in iterator:  File "", line 1, in 
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 85, in     return lambda *a: f(*a)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py",
>  line 113, in wrapper    return f(*args, **kwargs)  File "", line 3, 
> in f  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
>  line 148, in value    self._value = self.load_from_path(self._path)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
>  line 124, in load_from_path    with open(path, 'rb', 1 << 20) as 
> f:MemoryError
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>  at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
>  at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
>  Source)

[jira] [Updated] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set

2020-04-09 Thread JinxinTang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JinxinTang updated SPARK-31386:
---
Attachment: 选区_267.png

> Reading broadcast in UDF raises MemoryError when 
> spark.executor.pyspark.memory is set
> -
>
> Key: SPARK-31386
> URL: https://issues.apache.org/jira/browse/SPARK-31386
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4 or AWS EMR
> `pyspark --conf spark.executor.pyspark.memory=500m`
>Reporter: Viacheslav Krot
>Priority: Major
> Attachments: 选区_267.png
>
>
> Following code with udf causes MemoryError when 
> `spark.executor.pyspark.memory` is set
> ```
> from pyspark.sql.types import BooleanType
>  from pyspark.sql.functions import udf
> df = spark.createDataFrame([
>    ('Alice', 10),
>    ('Bob', 12)
>  ], ['name', 'cnt'])
> broadcast = spark.sparkContext.broadcast([1,2,3])
> @udf(BooleanType())
>  def f(cnt):
>    return cnt < len(broadcast.value)
> df.filter(f(df.cnt)).count()
> ```
> Same code work well when spark.executor.pyspark.memory is not set. 
> The code by itself does not make any sense, just simplest code to reproduce 
> the bug.
>  
> Error:
> ```
> 20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, 
> ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 
> 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 377, in main    process()  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 372, in process    serializer.dump_stream(func(split_index, iterator), 
> outfile)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 345, in dump_stream    
> self.serializer.dump_stream(self._batched(iterator), stream)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 141, in dump_stream    for obj in iterator:  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 334, in _batched    for item in iterator:  File "", line 1, in 
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 85, in     return lambda *a: f(*a)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py",
>  line 113, in wrapper    return f(*args, **kwargs)  File "", line 3, 
> in f  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
>  line 148, in value    self._value = self.load_from_path(self._path)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
>  line 124, in load_from_path    with open(path, 'rb', 1 << 20) as 
> f:MemoryError
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>  at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
>  at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
>

[jira] [Commented] (SPARK-31406) ThriftServerQueryTestSuite: Sharing test data and test tables among multiple test cases.

2020-04-09 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080203#comment-17080203
 ] 

jiaan.geng commented on SPARK-31406:


I'm working on.

> ThriftServerQueryTestSuite: Sharing test data and test tables among multiple 
> test cases.
> 
>
> Key: SPARK-31406
> URL: https://issues.apache.org/jira/browse/SPARK-31406
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Minor
>
> ThriftServerQueryTestSuite spend 17 minutes time to test.
> I checked the code and found ThriftServerQueryTestSuite load test data 
> repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31406) ThriftServerQueryTestSuite: Sharing test data and test tables among multiple test cases.

2020-04-09 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-31406:
--

 Summary: ThriftServerQueryTestSuite: Sharing test data and test 
tables among multiple test cases.
 Key: SPARK-31406
 URL: https://issues.apache.org/jira/browse/SPARK-31406
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: jiaan.geng


ThriftServerQueryTestSuite spend 17 minutes time to test.
I checked the code and found ThriftServerQueryTestSuite load test data 
repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31361) Rebase datetime in parquet/avro according to file written Spark version

2020-04-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31361:

Parent: SPARK-31404
Issue Type: Sub-task  (was: Bug)

> Rebase datetime in parquet/avro according to file written Spark version
> ---
>
> Key: SPARK-31361
> URL: https://issues.apache.org/jira/browse/SPARK-31361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-04-09 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080193#comment-17080193
 ] 

Wenchen Fan commented on SPARK-30951:
-

FTYI I've created a ticket for the fail-by-default behavior: 
https://issues.apache.org/jira/browse/SPARK-31405

> Potential data loss for legacy applications after switch to proleptic 
> Gregorian calendar
> 
>
> Key: SPARK-30951
> URL: https://issues.apache.org/jira/browse/SPARK-30951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Assignee: Maxim Gekk
>Priority: Blocker
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
> containing dates before October 15, 1582. This could be an issue when such 
> sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps 
> before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on 
> dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode 
> sensitive dates. On insert, the library encodes the actual date as some other 
> date. On select, the library decodes the date back to the original date. The 
> encoded value could be any date, including one before October 15, 1582 (e.g., 
> "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as 
> a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was 
> upgraded to use the proleptic Gregorian calendar. Spark applications that 
> read files created by the upgraded component were interpreting encoded or 
> marker dates incorrectly, and vice versa. Also, their data now had a mix of 
> calendars (hybrid and proleptic Gregorian) with no metadata to indicate which 
> file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates 
> using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into 
> trouble when run on Spark 3. The application may not properly interpret the 
> dates previously written by Spark 2. Also, once the Spark 3 version of the 
> application writes data, the tables will have a mix of calendars (hybrid and 
> proleptic gregorian) with no metadata to indicate which file uses which 
> calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data 
> written by one version that cannot be interpreted by the other. And as above, 
> the tables will now have a mix of calendars with no way to detect which file 
> uses which calendar.
> As with the two real-life example cases, these applications may have enormous 
> amounts of legacy data, so re-encoding the dates using some other scheme may 
> not be feasible.
> We might want to consider a configuration setting to allow the user to 
> specify the calendar for storing and retrieving date and timestamp values 
> (not sure how such a flag would affect other date and timestamp-related 
> functions). I realize the change is far bigger than just adding a 
> configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case 
> of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
> proleptic Gregorian calendar between 2.x and 3.x. After some upgrade 
> headaches related to dates before 1582, the Hive community made the following 
> changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> checks a configuration setting to determine which calendar to use.
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> stores the calendar type in the metadata.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files, 
> Hive checks the metadata for the calendar type.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files that 
> lack calendar metadata, Hive's behavior is determined by a configuration 
> setting. This allows Hive to read legacy data (note: if the data already 
> consists of a mix of calendar types with no metadata,

[jira] [Comment Edited] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-04-09 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080193#comment-17080193
 ] 

Wenchen Fan edited comment on SPARK-30951 at 4/10/20, 3:40 AM:
---

FYI I've created a ticket for the fail-by-default behavior: 
https://issues.apache.org/jira/browse/SPARK-31405


was (Author: cloud_fan):
FTYI I've created a ticket for the fail-by-default behavior: 
https://issues.apache.org/jira/browse/SPARK-31405

> Potential data loss for legacy applications after switch to proleptic 
> Gregorian calendar
> 
>
> Key: SPARK-30951
> URL: https://issues.apache.org/jira/browse/SPARK-30951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Assignee: Maxim Gekk
>Priority: Blocker
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
> containing dates before October 15, 1582. This could be an issue when such 
> sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps 
> before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on 
> dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode 
> sensitive dates. On insert, the library encodes the actual date as some other 
> date. On select, the library decodes the date back to the original date. The 
> encoded value could be any date, including one before October 15, 1582 (e.g., 
> "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as 
> a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was 
> upgraded to use the proleptic Gregorian calendar. Spark applications that 
> read files created by the upgraded component were interpreting encoded or 
> marker dates incorrectly, and vice versa. Also, their data now had a mix of 
> calendars (hybrid and proleptic Gregorian) with no metadata to indicate which 
> file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates 
> using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into 
> trouble when run on Spark 3. The application may not properly interpret the 
> dates previously written by Spark 2. Also, once the Spark 3 version of the 
> application writes data, the tables will have a mix of calendars (hybrid and 
> proleptic gregorian) with no metadata to indicate which file uses which 
> calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data 
> written by one version that cannot be interpreted by the other. And as above, 
> the tables will now have a mix of calendars with no way to detect which file 
> uses which calendar.
> As with the two real-life example cases, these applications may have enormous 
> amounts of legacy data, so re-encoding the dates using some other scheme may 
> not be feasible.
> We might want to consider a configuration setting to allow the user to 
> specify the calendar for storing and retrieving date and timestamp values 
> (not sure how such a flag would affect other date and timestamp-related 
> functions). I realize the change is far bigger than just adding a 
> configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case 
> of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
> proleptic Gregorian calendar between 2.x and 3.x. After some upgrade 
> headaches related to dates before 1582, the Hive community made the following 
> changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> checks a configuration setting to determine which calendar to use.
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> stores the calendar type in the metadata.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files, 
> Hive checks the metadata for the calendar type.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files that 
> lack calendar

[jira] [Created] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not

2020-04-09 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-31405:
---

 Summary: fail by default when read/write datetime values and not 
sure if they need rebase or not
 Key: SPARK-31405
 URL: https://issues.apache.org/jira/browse/SPARK-31405
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31297) Speed-up date-time rebasing

2020-04-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31297:

Parent Issue: SPARK-31404  (was: SPARK-30951)

> Speed-up date-time rebasing
> ---
>
> Key: SPARK-31297
> URL: https://issues.apache.org/jira/browse/SPARK-31297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> I do believe it is possible to speed up date-time rebasing by building a map 
> of micros to diffs between original and rebased micros. And look up at the 
> map via binary search.
> For example, the *America/Los_Angeles* time zone has less than 100 points 
> when diff changes:
> {code:scala}
>   test("optimize rebasing") {
> val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> var micros = start
> var diff = Long.MaxValue
> var counter = 0
> while (micros < end) {
>   val rebased = rebaseGregorianToJulianMicros(micros)
>   val curDiff = rebased - micros
>   if (curDiff != diff) {
> counter += 1
> diff = curDiff
> val ldt = 
> microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
> println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
> minutes")
>   }
>   micros += MICROS_PER_HOUR
> }
> println(s"counter = $counter")
>   }
> {code}
> {code:java}
> local date-time = 0001-01-01T00:00 diff = -2909 minutes
> local date-time = 0100-02-28T14:00 diff = -1469 minutes
> local date-time = 0200-02-28T14:00 diff = -29 minutes
> local date-time = 0300-02-28T14:00 diff = 1410 minutes
> local date-time = 0500-02-28T14:00 diff = 2850 minutes
> local date-time = 0600-02-28T14:00 diff = 4290 minutes
> local date-time = 0700-02-28T14:00 diff = 5730 minutes
> local date-time = 0900-02-28T14:00 diff = 7170 minutes
> local date-time = 1000-02-28T14:00 diff = 8610 minutes
> local date-time = 1100-02-28T14:00 diff = 10050 minutes
> local date-time = 1300-02-28T14:00 diff = 11490 minutes
> local date-time = 1400-02-28T14:00 diff = 12930 minutes
> local date-time = 1500-02-28T14:00 diff = 14370 minutes
> local date-time = 1582-10-14T14:00 diff = -29 minutes
> local date-time = 1899-12-31T16:52:58 diff = 0 minutes
> local date-time = 1917-12-27T11:52:58 diff = 60 minutes
> local date-time = 1917-12-27T12:52:58 diff = 0 minutes
> local date-time = 1918-09-15T12:52:58 diff = 60 minutes
> local date-time = 1918-09-15T13:52:58 diff = 0 minutes
> local date-time = 1919-06-30T16:52:58 diff = 31 minutes
> local date-time = 1919-06-30T17:52:58 diff = 0 minutes
> local date-time = 1919-08-15T12:52:58 diff = 60 minutes
> local date-time = 1919-08-15T13:52:58 diff = 0 minutes
> local date-time = 1921-08-31T10:52:58 diff = 60 minutes
> local date-time = 1921-08-31T11:52:58 diff = 0 minutes
> local date-time = 1921-09-30T11:52:58 diff = 60 minutes
> local date-time = 1921-09-30T12:52:58 diff = 0 minutes
> local date-time = 1922-09-30T12:52:58 diff = 60 minutes
> local date-time = 1922-09-30T13:52:58 diff = 0 minutes
> local date-time = 1981-09-30T12:52:58 diff = 60 minutes
> local date-time = 1981-09-30T13:52:58 diff = 0 minutes
> local date-time = 1982-09-30T12:52:58 diff = 60 minutes
> local date-time = 1982-09-30T13:52:58 diff = 0 minutes
> local date-time = 1983-09-30T12:52:58 diff = 60 minutes
> local date-time = 1983-09-30T13:52:58 diff = 0 minutes
> local date-time = 1984-09-29T15:52:58 diff = 60 minutes
> local date-time = 1984-09-29T16:52:58 diff = 0 minutes
> local date-time = 1985-09-28T15:52:58 diff = 60 minutes
> local date-time = 1985-09-28T16:52:58 diff = 0 minutes
> local date-time = 1986-09-27T15:52:58 diff = 60 minutes
> local date-time = 1986-09-27T16:52:58 diff = 0 minutes
> local date-time = 1987-09-26T15:52:58 diff = 60 minutes
> local date-time = 1987-09-26T16:52:58 diff = 0 minutes
> local date-time = 1988-09-24T15:52:58 diff = 60 minutes
> local date-time = 1988-09-24T16:52:58 diff = 0 minutes
> local date-time = 1989-09-23T15:52:58 diff = 60 minutes
> local date-time = 1989-09-23T16:52:58 diff = 0 minutes
> local date-time = 1990-09-29T15:52:58 diff = 60 minutes
> local date-time = 1990-09-29T16:52:58 diff = 0 minutes
> local date-time = 1991-09-28T16:52:58 diff = 60 minutes
> local date-time = 1991-09-28T17:52:58 diff = 0 minutes
> local date-time = 1992-09-26T15:52:58 diff = 60 minutes
> local date-time = 1992-09-26T16:52:58 diff = 0 minutes
> local date-time = 1993-09-25T15:52:58 diff = 60 minutes
> local date-time =

[jira] [Updated] (SPARK-31359) Speed up timestamps rebasing

2020-04-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31359:

Parent: SPARK-31404
Issue Type: Sub-task  (was: Improvement)

> Speed up timestamps rebasing
> 
>
> Key: SPARK-31359
> URL: https://issues.apache.org/jira/browse/SPARK-31359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, rebasing of timestamps is performed via conversions to local 
> timestamps and back to microseconds. This is CPU intensive operation which 
> can be avoid by converting via pre-calculated tables per each time zone. For 
> example, the below is timestamps when diffs are changed in 
> America/Los_Angeles time zone for the range 0001-01-01...2100-01-01
> {code}
> 0001-01-01T00:00 diff = -2872 minutes
> 0100-03-01T00:00 diff = -1432 minutes
> 0200-03-01T00:00 diff = 7 minutes
> 0300-03-01T00:00 diff = 1447 minutes
> 0500-03-01T00:00 diff = 2887 minutes
> 0600-03-01T00:00 diff = 4327 minutes
> 0700-03-01T00:00 diff = 5767 minutes
> 0900-03-01T00:00 diff = 7207 minutes
> 1000-03-01T00:00 diff = 8647 minutes
> 1100-03-01T00:00 diff = 10087 minutes
> 1300-03-01T00:00 diff = 11527 minutes
> 1400-03-01T00:00 diff = 12967 minutes
> 1500-03-01T00:00 diff = 14407 minutes
> 1582-10-15T00:00 diff = 7 minutes
> 1883-11-18T12:22:58 diff = 0 minutes
> {code}
> It seems it is possible to build rebasing maps, and perform rebasing via the 
> maps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31385) Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing

2020-04-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31385:

Parent: SPARK-31404
Issue Type: Sub-task  (was: Bug)

> Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing
> -
>
> Key: SPARK-31385
> URL: https://issues.apache.org/jira/browse/SPARK-31385
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Microseconds rebasing from the hybrid calendar (Julian + Gregorian) to 
> Proleptic Gregorian calendar is not symmetric to opposite conversion for the 
> following time zones:
>  #  Asia/Tehran
>  # Iran
>  # Africa/Casablanca
>  # Africa/El_Aaiun
> Here is the results from the https://github.com/apache/spark/pull/28119:
> Julian -> Gregorian:
> {code:json}
> , {
>   "tz" : "Asia/Tehran",
>   "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, 
> -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, 
> -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, 
> -2208988800, 2547315000, 2547401400 ],
>   "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, 
> -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ]
> }, {
>   "tz" : "Iran",
>   "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, 
> -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, 
> -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, 
> -2208988800, 2547315000, 2547401400 ],
>   "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, 
> -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ]
> }, {
>   "tz" : "Africa/Casablanca",
>   "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, 
> -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, 
> -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, 
> -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 
> 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 
> 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 
> 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 
> 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 
> 2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 
> 2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 
> 2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, 
> 2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, 
> 2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, 
> 3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, 
> 3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, 
> 3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, 
> 3332714400, 3335742000, 3362954400, 3366586800, 3393799200, 3396826800, 
> 3424644000, 3427671600, 3454884000, 3457911600, 3485728800, 3488756400, 
> 3515968800, 3519601200, 3546813600, 3549841200, 3577658400, 3580686000, 
> 3607898400, 3611530800, 3638743200, 3641770800, 3669588000, 3672615600, 
> 3699828000, 3702855600 ],
>   "diffs" : [ 174620, 88220, 1820, -84580, -170980, -257380, -343780, 
> -430180, -516580, -602980, -689380, -775780, -862180, 1820, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 
> 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 
> 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 
> 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600 ]
> }, {
>   "tz" : "Africa/El_Aaiun",
>   "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, 
> -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, 
> -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, 
> -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 
> 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 
> 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 
> 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 
> 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 
> 2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 
> 2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 
> 2781136800,

[jira] [Created] (SPARK-31404) backward compatibility after switching to Proleptic Gregorian calendar

2020-04-09 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-31404:
---

 Summary: backward compatibility after switching to Proleptic 
Gregorian calendar
 Key: SPARK-31404
 URL: https://issues.apache.org/jira/browse/SPARK-31404
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan


In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java 8 
datetime APIs. This makes Spark follow the ISO and SQL standard, but introduces 
some backward compatibility problems:
1. may read wrong data from the data files written by Spark 2.4
2. may have perf regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31359) Speed up timestamps rebasing

2020-04-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31359:
---

Assignee: Maxim Gekk

> Speed up timestamps rebasing
> 
>
> Key: SPARK-31359
> URL: https://issues.apache.org/jira/browse/SPARK-31359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently, rebasing of timestamps is performed via conversions to local 
> timestamps and back to microseconds. This is CPU intensive operation which 
> can be avoid by converting via pre-calculated tables per each time zone. For 
> example, the below is timestamps when diffs are changed in 
> America/Los_Angeles time zone for the range 0001-01-01...2100-01-01
> {code}
> 0001-01-01T00:00 diff = -2872 minutes
> 0100-03-01T00:00 diff = -1432 minutes
> 0200-03-01T00:00 diff = 7 minutes
> 0300-03-01T00:00 diff = 1447 minutes
> 0500-03-01T00:00 diff = 2887 minutes
> 0600-03-01T00:00 diff = 4327 minutes
> 0700-03-01T00:00 diff = 5767 minutes
> 0900-03-01T00:00 diff = 7207 minutes
> 1000-03-01T00:00 diff = 8647 minutes
> 1100-03-01T00:00 diff = 10087 minutes
> 1300-03-01T00:00 diff = 11527 minutes
> 1400-03-01T00:00 diff = 12967 minutes
> 1500-03-01T00:00 diff = 14407 minutes
> 1582-10-15T00:00 diff = 7 minutes
> 1883-11-18T12:22:58 diff = 0 minutes
> {code}
> It seems it is possible to build rebasing maps, and perform rebasing via the 
> maps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31359) Speed up timestamps rebasing

2020-04-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31359.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28163
[https://github.com/apache/spark/pull/28163]

> Speed up timestamps rebasing
> 
>
> Key: SPARK-31359
> URL: https://issues.apache.org/jira/browse/SPARK-31359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, rebasing of timestamps is performed via conversions to local 
> timestamps and back to microseconds. This is CPU intensive operation which 
> can be avoid by converting via pre-calculated tables per each time zone. For 
> example, the below is timestamps when diffs are changed in 
> America/Los_Angeles time zone for the range 0001-01-01...2100-01-01
> {code}
> 0001-01-01T00:00 diff = -2872 minutes
> 0100-03-01T00:00 diff = -1432 minutes
> 0200-03-01T00:00 diff = 7 minutes
> 0300-03-01T00:00 diff = 1447 minutes
> 0500-03-01T00:00 diff = 2887 minutes
> 0600-03-01T00:00 diff = 4327 minutes
> 0700-03-01T00:00 diff = 5767 minutes
> 0900-03-01T00:00 diff = 7207 minutes
> 1000-03-01T00:00 diff = 8647 minutes
> 1100-03-01T00:00 diff = 10087 minutes
> 1300-03-01T00:00 diff = 11527 minutes
> 1400-03-01T00:00 diff = 12967 minutes
> 1500-03-01T00:00 diff = 14407 minutes
> 1582-10-15T00:00 diff = 7 minutes
> 1883-11-18T12:22:58 diff = 0 minutes
> {code}
> It seems it is possible to build rebasing maps, and perform rebasing via the 
> maps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30808) Thrift server returns wrong timestamps/dates strings before 1582

2020-04-09 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-30808.
-
Resolution: Cannot Reproduce

> Thrift server returns wrong timestamps/dates strings before 1582
> 
>
> Key: SPARK-30808
> URL: https://issues.apache.org/jira/browse/SPARK-30808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Set the environment variable:
> {code:bash}
> export TZ="America/Los_Angeles"
> ./bin/spark-sql -S
> {code}
> {code:sql}
> spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
> spark.sql.session.timeZoneAmerica/Los_Angeles
> spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
> 1001-01-01 00:07:02
> {code}
> The expected result must be *1001-01-01 00:00:00*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30808) Thrift server returns wrong timestamps/dates strings before 1582

2020-04-09 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080183#comment-17080183
 ] 

Xiao Li commented on SPARK-30808:
-

The problem does not exist. 

> Thrift server returns wrong timestamps/dates strings before 1582
> 
>
> Key: SPARK-30808
> URL: https://issues.apache.org/jira/browse/SPARK-30808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Set the environment variable:
> {code:bash}
> export TZ="America/Los_Angeles"
> ./bin/spark-sql -S
> {code}
> {code:sql}
> spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
> spark.sql.session.timeZoneAmerica/Los_Angeles
> spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
> 1001-01-01 00:07:02
> {code}
> The expected result must be *1001-01-01 00:00:00*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-30808) Thrift server returns wrong timestamps/dates strings before 1582

2020-04-09 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-30808:
-

> Thrift server returns wrong timestamps/dates strings before 1582
> 
>
> Key: SPARK-30808
> URL: https://issues.apache.org/jira/browse/SPARK-30808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Set the environment variable:
> {code:bash}
> export TZ="America/Los_Angeles"
> ./bin/spark-sql -S
> {code}
> {code:sql}
> spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
> spark.sql.session.timeZoneAmerica/Los_Angeles
> spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
> 1001-01-01 00:07:02
> {code}
> The expected result must be *1001-01-01 00:00:00*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat

2020-04-09 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080175#comment-17080175
 ] 

zhengruifeng commented on SPARK-31301:
--

(One small question: the new method takes a Dataset[_] rather than DataFrame; 
should it be DataFrame too?)

I think so, we may need to change it to DataFrame;

 

what is different about what they do then?

I think existing two methods provide similar function, since for the first 
method, what we can do is just collecting it back to the driver, so I think it 
is similar to the second one.

If we add another method (or change the return type of second method), then we 
can do more operations on it.

 

Another point is that: If we return dataframe containing one row per feature, 
then we can avoid the bottleneck on the driver, because we no longer need to 
collect test results of all features back to the driver. I will test it and 
maybe send a PR to show it.

 

 

 

> flatten the result dataframe of tests in stat
> -
>
> Key: SPARK-31301
> URL: https://issues.apache.org/jira/browse/SPARK-31301
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> {code:java}
>  scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import 
> org.apache.spark.ml.stat.ChiSquareTest
> import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq(
>  |   (0.0, Vectors.dense(0.5, 10.0)),
>  |   (0.0, Vectors.dense(1.5, 20.0)),
>  |   (1.0, Vectors.dense(1.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 40.0)),
>  |   (1.0, Vectors.dense(3.5, 40.0))
>  | )
> data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = 
> List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), 
> (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = 
> data.toDF("label", "features")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala>  
>val chi = ChiSquareTest.test(df, "features", "label")
> chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: 
> array ... 1 more field]scala> chi.show
> +++--+
> | pValues|degreesOfFreedom|statistics|
> +++--+
> |[0.68728927879097...|  [2, 3]|[0.75,1.5]|
> +++--+{code}
>  
> Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, 
> {{Correlation}} all return a df only containing one row.
> I think this is quite hard to use, suppose we have a dataset with dim=1000, 
> the only operation we can deal with the test result is to collect it by 
> {{head()}} or {{first(), and then use it in the driver.}}
> {{While what I really want to do is filtering the df like pValue>0.1}} or 
> {{corr<0.5}}, *So I suggest to flatten the output df in those tests.*
>  
> {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but 
> ChiSquareTest and Correlation were here for a long time.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31355) Document TABLESAMPLE in SQL Reference

2020-04-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31355.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28130
[https://github.com/apache/spark/pull/28130]

> Document TABLESAMPLE in SQL Reference
> -
>
> Key: SPARK-31355
> URL: https://issues.apache.org/jira/browse/SPARK-31355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Document TABLESAMPLE in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31355) Document TABLESAMPLE in SQL Reference

2020-04-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-31355:


Assignee: Huaxin Gao

> Document TABLESAMPLE in SQL Reference
> -
>
> Key: SPARK-31355
> URL: https://issues.apache.org/jira/browse/SPARK-31355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Document TABLESAMPLE in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30819) Add FMRegressor wrapper to SparkR

2020-04-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30819.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27571
[https://github.com/apache/spark/pull/27571]

> Add FMRegressor wrapper to SparkR
> -
>
> Key: SPARK-30819
> URL: https://issues.apache.org/jira/browse/SPARK-30819
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Spark should provide a wrapper for {{o.a.s.ml.regression. FMRegressor}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30819) Add FMRegressor wrapper to SparkR

2020-04-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30819:


Assignee: Maciej Szymkiewicz

> Add FMRegressor wrapper to SparkR
> -
>
> Key: SPARK-30819
> URL: https://issues.apache.org/jira/browse/SPARK-30819
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Spark should provide a wrapper for {{o.a.s.ml.regression. FMRegressor}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30808) Thrift server returns wrong timestamps/dates strings before 1582

2020-04-09 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080102#comment-17080102
 ] 

Xiao Li commented on SPARK-30808:
-

Is it reverted?

> Thrift server returns wrong timestamps/dates strings before 1582
> 
>
> Key: SPARK-30808
> URL: https://issues.apache.org/jira/browse/SPARK-30808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Set the environment variable:
> {code:bash}
> export TZ="America/Los_Angeles"
> ./bin/spark-sql -S
> {code}
> {code:sql}
> spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
> spark.sql.session.timeZoneAmerica/Los_Angeles
> spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
> 1001-01-01 00:07:02
> {code}
> The expected result must be *1001-01-01 00:00:00*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31403) TreeNode asCode function incorrectly handles null literals

2020-04-09 Thread Carl Sverre (Jira)

Carl Sverre created SPARK-31403:
---

 Summary: TreeNode asCode function incorrectly handles null literals
 Key: SPARK-31403
 URL: https://issues.apache.org/jira/browse/SPARK-31403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Carl Sverre


In the TreeNode code in Catalyst the asCode function incorrectly handles null 
literals.  When it tries to render a null literal it will match {{null}} using 
the third case expression and try to call {{null.toString}} which will raise a 
NullPointerException.

I verified this bug exists in Spark 2.4.4 and the same code appears to be in 
master:

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L707]

The fix seems trivial - add an explicit case for null.

One way to reproduce this is via:
{code:java}

  val plan =
spark
  .sql("select if(isnull(id), null, 2) from testdb_jdbc.users")
  .queryExecution
  .optimizedPlan
  println(plan.asInstanceOf[Project].projectList.head.asCode) {code}
However any other way which generates a Literal with the value null will cause 
the issue.

In this case the above SparkSQL will generate the literal: {{Literal(null, 
IntegerType)}} for the "trueValue" of the if statement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31399) closure cleaner is broken in Spark 3.0

2020-04-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31399:
--
Priority: Blocker  (was: Major)

> closure cleaner is broken in Spark 3.0
> --
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31399) closure cleaner is broken in Spark 3.0

2020-04-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31399:
--
Target Version/s: 3.0.0

> closure cleaner is broken in Spark 3.0
> --
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31399) closure cleaner is broken in Spark 3.0

2020-04-09 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079992#comment-17079992
 ] 

Dongjoon Hyun commented on SPARK-31399:
---

Thank you for pinging me, [~cloud_fan].
cc [~dbtsai] and [~holden]

> closure cleaner is broken in Spark 3.0
> --
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31402) Incorrect rebasing of BCE dates

2020-04-09 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31402:
--

 Summary: Incorrect rebasing of BCE dates
 Key: SPARK-31402
 URL: https://issues.apache.org/jira/browse/SPARK-31402
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Dates of before common era are rebased incorrectly, see 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120679/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql/
{code}
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
postgreSQL/date.sql
Expected "[-0044]-03-15", but got "[0045]-03-15" Result did not match for query 
#93
select make_date(-44, 3, 15)
{code}
Even such dates are out of the valid range of dates supported by the DATE type, 
there is a test in postgreSQL/date.sql for a negative year, and it would be 
nice to fix the issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31401) Add JDK11 example in `bin/docker-image-tool.sh` usage

2020-04-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31401:
-

Assignee: Dongjoon Hyun

> Add JDK11 example in `bin/docker-image-tool.sh` usage
> -
>
> Key: SPARK-31401
> URL: https://issues.apache.org/jira/browse/SPARK-31401
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> {code:java}
> $ bin/docker-image-tool.sh -h
> ...
>   - Build and push JDK11-based image with tag "v3.0.0" to docker.io/myrepo
> bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 -b 
> java_image_tag=11-jre-slim build
> bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 push {code}
>  
> {code:java}
> $ docker run -it --rm docker.io/myrepo/spark:v3.0.0 java --version | tail -n3
> openjdk 11.0.6 2020-01-14
> OpenJDK Runtime Environment 18.9 (build 11.0.6+10)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.6+10, mixed mode) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31401) Add JDK11 example in `bin/docker-image-tool.sh` usage

2020-04-09 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-31401:
-

 Summary: Add JDK11 example in `bin/docker-image-tool.sh` usage
 Key: SPARK-31401
 URL: https://issues.apache.org/jira/browse/SPARK-31401
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


{code:java}
$ bin/docker-image-tool.sh -h
...
  - Build and push JDK11-based image with tag "v3.0.0" to docker.io/myrepo
bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 -b 
java_image_tag=11-jre-slim build
bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 push {code}
 
{code:java}
$ docker run -it --rm docker.io/myrepo/spark:v3.0.0 java --version | tail -n3
openjdk 11.0.6 2020-01-14
OpenJDK Runtime Environment 18.9 (build 11.0.6+10)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.6+10, mixed mode) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31377) Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite

2020-04-09 Thread Srinivas Rishindra Pothireddi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivas Rishindra Pothireddi updated SPARK-31377:
--
Description: 
For some combinations of join algorithm and join types there are no unit tests 
for the "number of output rows" metric.

A list of missing unit tests include the following.
 * SortMergeJoin: ExistenceJoin
 * ShuffledHashJoin: OuterJoin, leftOuter, RightOuter, LeftAnti, LeftSemi, 
ExistenseJoin
 * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin
 * BroadcastHashJoin: LeftAnti, ExistenceJoin

  was:
For some combinations of join algorithm and join types there are no unit tests 
for the "number of output rows" metric.

A list of missing unit tests include the following.
 * SortMergeJoin: ExistenceJoin
 * ShuffledHashJoin: OuterJoin, ReftOuter, RightOuter, LeftAnti, LeftSemi, 
ExistenseJoin
 * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin
 * BroadcastHashJoin: LeftAnti, ExistenceJoin


> Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite
> --
>
> Key: SPARK-31377
> URL: https://issues.apache.org/jira/browse/SPARK-31377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Minor
>
> For some combinations of join algorithm and join types there are no unit 
> tests for the "number of output rows" metric.
> A list of missing unit tests include the following.
>  * SortMergeJoin: ExistenceJoin
>  * ShuffledHashJoin: OuterJoin, leftOuter, RightOuter, LeftAnti, LeftSemi, 
> ExistenseJoin
>  * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin
>  * BroadcastHashJoin: LeftAnti, ExistenceJoin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31400) The catalogString doesn't distinguish Vectors in ml and mllib

2020-04-09 Thread Junpei Zhou (Jira)

Junpei Zhou created SPARK-31400:
---

 Summary: The catalogString doesn't distinguish Vectors in ml and 
mllib
 Key: SPARK-31400
 URL: https://issues.apache.org/jira/browse/SPARK-31400
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.4.5
 Environment: Ubuntu 16.04
Reporter: Junpei Zhou


h2. Bug Description

The `catalogString` is not detailed enough to distinguish the 
pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
h2. How to reproduce the bug

[Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an 
example from the official document (Python code). If I keep all other lines 
untouched, and only modify the Vectors import line, which means:

 
{code:java}
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
{code}
Or you can directly execute the following code snippet:

 

 
{code:java}
from pyspark.ml.feature import MinMaxScaler
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(dataFrame)
{code}
It will raise an error:

 
{code:java}
IllegalArgumentException: 'requirement failed: Column features must be of type 
struct,values:array> but was 
actually struct,values:array>.'
{code}
However, the actually struct and the desired struct are exactly the same 
string, which cannot provide useful information to the programmer. I would 
suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and 
pyspark.mllib.linalg.Vectors.

Thanks!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31400) The catalogString doesn't distinguish Vectors in ml and mllib

2020-04-09 Thread Junpei Zhou (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junpei Zhou updated SPARK-31400:

Description: 
h2. Bug Description

The `catalogString` is not detailed enough to distinguish the 
pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
h2. How to reproduce the bug

[Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an 
example from the official document (Python code). If I keep all other lines 
untouched, and only modify the Vectors import line, which means:
{code:java}
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
{code}
Or you can directly execute the following code snippet:
{code:java}
from pyspark.ml.feature import MinMaxScaler
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(dataFrame)
{code}
It will raise an error:
{code:java}
IllegalArgumentException: 'requirement failed: Column features must be of type 
struct,values:array> but was 
actually struct,values:array>.'
{code}
However, the actually struct and the desired struct are exactly the same 
string, which cannot provide useful information to the programmer. I would 
suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and 
pyspark.mllib.linalg.Vectors.

Thanks!

 

  was:
h2. Bug Description

The `catalogString` is not detailed enough to distinguish the 
pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
h2. How to reproduce the bug

[Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an 
example from the official document (Python code). If I keep all other lines 
untouched, and only modify the Vectors import line, which means:

 
{code:java}
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
{code}
Or you can directly execute the following code snippet:

 

 
{code:java}
from pyspark.ml.feature import MinMaxScaler
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(dataFrame)
{code}
It will raise an error:

 
{code:java}
IllegalArgumentException: 'requirement failed: Column features must be of type 
struct,values:array> but was 
actually struct,values:array>.'
{code}
However, the actually struct and the desired struct are exactly the same 
string, which cannot provide useful information to the programmer. I would 
suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and 
pyspark.mllib.linalg.Vectors.

Thanks!

 


> The catalogString doesn't distinguish Vectors in ml and mllib
> -
>
> Key: SPARK-31400
> URL: https://issues.apache.org/jira/browse/SPARK-31400
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5
> Environment: Ubuntu 16.04
>Reporter: Junpei Zhou
>Priority: Major
>
> h2. Bug Description
> The `catalogString` is not detailed enough to distinguish the 
> pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
> h2. How to reproduce the bug
> [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an 
> example from the official document (Python code). If I keep all other lines 
> untouched, and only modify the Vectors import line, which means:
> {code:java}
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> {code}
> Or you can directly execute the following code snippet:
> {code:java}
> from pyspark.ml.feature import MinMaxScaler
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> dataFrame = spark.createDataFrame([
> (0, Vectors.dense([1.0, 0.1, -1.0]),),
> (1, Vectors.dense([2.0, 1.1, 1.0]),),
> (2, Vectors.dense([3.0, 10.1, 3.0]),)
> ], ["id", "features"])
> scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
> scalerModel = scaler.fit(dataFrame)
> {code}
> It will raise an error:
> {code:java}
> IllegalArgumentException: 'requirement failed: Column features must be of 
> type struct,values:array> 
> but was actually 
> struct,values:array>.'
> {code}
> However, the actually struct and the desired struct are exactly the same 
> string, which cannot provide useful information to the programmer. I would 
> suggest making the catalogString

[jira] [Resolved] (SPARK-31331) Document Spark integration with Hive UDFs/UDAFs/UDTFs

2020-04-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31331.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28104
[https://github.com/apache/spark/pull/28104]

> Document Spark integration with Hive UDFs/UDAFs/UDTFs
> -
>
> Key: SPARK-31331
> URL: https://issues.apache.org/jira/browse/SPARK-31331
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Document Spark integration with Hive UDFs/UDAFs/UDTFs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31331) Document Spark integration with Hive UDFs/UDAFs/UDTFs

2020-04-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-31331:


Assignee: Huaxin Gao

> Document Spark integration with Hive UDFs/UDAFs/UDTFs
> -
>
> Key: SPARK-31331
> URL: https://issues.apache.org/jira/browse/SPARK-31331
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> Document Spark integration with Hive UDFs/UDAFs/UDTFs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31389) Ensure all tests in SQLMetricsSuite run with both codegen on and off

2020-04-09 Thread Srinivas Rishindra Pothireddi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivas Rishindra Pothireddi updated SPARK-31389:
--
Description: 
Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
code paths (for example, generated code in "SortMergeJoin metrics") aren't 
exercised at all. The generated code should be tested as well.

*List of tests that run with codegen off*

Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
BroadcastHashJoin metrics,  ShuffledHashJoin metrics, BroadcastHashJoin(outer) 
metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics, 
CartesianProduct metrics,  SortMergeJoin(left-anti) metrics

 

  was:
Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
code paths (for example, generated code in "SortMergeJoin metrics") aren't 
exercised at all. The generated code should be tested as well.

*List of tests that run with codegen off*

Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
BroadcastHashJoin metrics,  ShuffledHashJoin metrics, BroadcastHashJoin(outer) 
metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics

 


> Ensure all tests in SQLMetricsSuite run with both codegen on and off
> 
>
> Key: SPARK-31389
> URL: https://issues.apache.org/jira/browse/SPARK-31389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Minor
>
> Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
> code paths (for example, generated code in "SortMergeJoin metrics") aren't 
> exercised at all. The generated code should be tested as well.
> *List of tests that run with codegen off*
> Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
> BroadcastHashJoin metrics,  ShuffledHashJoin metrics, 
> BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, 
> BroadcastLeftSemiJoinHash metrics, CartesianProduct metrics,  
> SortMergeJoin(left-anti) metrics
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31021) Support MariaDB Kerberos login in JDBC connector

2020-04-09 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-31021.

Fix Version/s: 3.1.0
 Assignee: Gabor Somogyi
   Resolution: Fixed

> Support MariaDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31021
> URL: https://issues.apache.org/jira/browse/SPARK-31021
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31399) closure cleaner is broken in Spark 3.0

2020-04-09 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079514#comment-17079514
 ] 

Xiao Li commented on SPARK-31399:
-

cc [~zsxwing]

> closure cleaner is broken in Spark 3.0
> --
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31399) closure cleaner is broken in Spark 3.0

2020-04-09 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079454#comment-17079454
 ] 

Wenchen Fan commented on SPARK-31399:
-

cc [~rxin] [~sowen] [~dongjoon] [~tgraves]

> closure cleaner is broken in Spark 3.0
> --
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31399) closure cleaner is broken in Spark 3.0

2020-04-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31399:

Description: 
The `ClosureCleaner` only support Scala functions and it uses the following 
check to catch closures
{code}
  // Check whether a class represents a Scala closure
  private def isClosure(cls: Class[_]): Boolean = {
cls.getName.contains("$anonfun$")
  }
{code}

This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
functions become Java lambdas.

As an example, the following code works well in Spark 2.4 Spark Shell:
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.functions.lit

case class Foo(id: String)
val col = lit("123")
val df = sc.range(0,10,1,1).map { _ => Foo("") }

// Exiting paste mode, now interpreting.

import org.apache.spark.sql.functions.lit
defined class Foo
col: org.apache.spark.sql.Column = 123
df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
{code}

But fails in 3.0
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.functions.lit

case class Foo(id: String)
val col = lit("123")
val df = sc.range(0,10,1,1).map { _ => Foo("") }

// Exiting paste mode, now interpreting.

org.apache.spark.SparkException: Task not serializable
  at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
  at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
  at org.apache.spark.rdd.RDD.map(RDD.scala:421)
  ... 39 elided
Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
Serialization stack:
- object not serializable (class: org.apache.spark.sql.Column, value: 
123)
- field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
- object (class $iw, $iw@2d87ac2b)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, 
SerializedLambda[capturingClass=class $iw, 
functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
 implementation=invokeStatic 
$anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
  at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
  at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
  at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
  at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
  ... 47 more
{code}

  was:
The `ClosureCleaner` only support Scala functions and it uses the following 
check to catch closures
{code}
  // Check whether a class represents a Scala closure
  private def isClosure(cls: Class[_]): Boolean = {
cls.getName.contains("$anonfun$")
  }
{code}

This doesn't work in 3.0 anyway as we upgrade to Scala 2.12 and most Scala 
functions become Java lambdas.

As an example, the following code works well in Spark 2.4 Spark Shell:
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.functions.lit

case class Foo(id: String)
val col = lit("123")
val df = sc.range(0,10,1,1).map { _ => Foo("") }

// Exiting paste mode, now interpreting.

import org.apache.spark.sql.functions.lit
defined class Foo
col: org.apache.spark.sql.Column = 123
df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
{code}

But fails in 3.0
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.functions.lit

case class Foo(id: String)
val col = lit("123")
val df = sc.range(0,10,1,1).map { _ => Foo("") }

// Exiting paste mode, now interpreting.

org.apache.spark.SparkException: Task not serializable
  at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
  at

[jira] [Created] (SPARK-31399) closure cleaner is broken in Spark 3.0

2020-04-09 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-31399:
---

 Summary: closure cleaner is broken in Spark 3.0
 Key: SPARK-31399
 URL: https://issues.apache.org/jira/browse/SPARK-31399
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Wenchen Fan


The `ClosureCleaner` only support Scala functions and it uses the following 
check to catch closures
{code}
  // Check whether a class represents a Scala closure
  private def isClosure(cls: Class[_]): Boolean = {
cls.getName.contains("$anonfun$")
  }
{code}

This doesn't work in 3.0 anyway as we upgrade to Scala 2.12 and most Scala 
functions become Java lambdas.

As an example, the following code works well in Spark 2.4 Spark Shell:
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.functions.lit

case class Foo(id: String)
val col = lit("123")
val df = sc.range(0,10,1,1).map { _ => Foo("") }

// Exiting paste mode, now interpreting.

import org.apache.spark.sql.functions.lit
defined class Foo
col: org.apache.spark.sql.Column = 123
df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
{code}

But fails in 3.0
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.functions.lit

case class Foo(id: String)
val col = lit("123")
val df = sc.range(0,10,1,1).map { _ => Foo("") }

// Exiting paste mode, now interpreting.

org.apache.spark.SparkException: Task not serializable
  at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
  at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
  at org.apache.spark.rdd.RDD.map(RDD.scala:421)
  ... 39 elided
Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
Serialization stack:
- object not serializable (class: org.apache.spark.sql.Column, value: 
123)
- field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
- object (class $iw, $iw@2d87ac2b)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, 
SerializedLambda[capturingClass=class $iw, 
functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
 implementation=invokeStatic 
$anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
  at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
  at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
  at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
  at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
  ... 47 more
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat

2020-04-09 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079371#comment-17079371
 ] 

Sean R. Owen commented on SPARK-31301:
--

(One small question: the new method takes a Dataset[_] rather than DataFrame; 
should it be DataFrame too?)

The new method's purpose seemed to be more about about providing the result as 
objects rather than a DataFrame. The proposal would require making the new 
method also return a DataFrame, and then we have two very similar methods; what 
is different about what they do then, for the caller looking at the API?

> flatten the result dataframe of tests in stat
> -
>
> Key: SPARK-31301
> URL: https://issues.apache.org/jira/browse/SPARK-31301
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> {code:java}
>  scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import 
> org.apache.spark.ml.stat.ChiSquareTest
> import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq(
>  |   (0.0, Vectors.dense(0.5, 10.0)),
>  |   (0.0, Vectors.dense(1.5, 20.0)),
>  |   (1.0, Vectors.dense(1.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 40.0)),
>  |   (1.0, Vectors.dense(3.5, 40.0))
>  | )
> data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = 
> List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), 
> (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = 
> data.toDF("label", "features")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala>  
>val chi = ChiSquareTest.test(df, "features", "label")
> chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: 
> array ... 1 more field]scala> chi.show
> +++--+
> | pValues|degreesOfFreedom|statistics|
> +++--+
> |[0.68728927879097...|  [2, 3]|[0.75,1.5]|
> +++--+{code}
>  
> Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, 
> {{Correlation}} all return a df only containing one row.
> I think this is quite hard to use, suppose we have a dataset with dim=1000, 
> the only operation we can deal with the test result is to collect it by 
> {{head()}} or {{first(), and then use it in the driver.}}
> {{While what I really want to do is filtering the df like pValue>0.1}} or 
> {{corr<0.5}}, *So I suggest to flatten the output df in those tests.*
>  
> {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but 
> ChiSquareTest and Correlation were here for a long time.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-04-09 Thread Everett Rush (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079337#comment-17079337
 ] 

Everett Rush commented on SPARK-27249:
--

Hi Nick,

 

I think the solution could make use of the mapPartitions method, but still 
there should be a multicolumn transformer class. The unary transformer is very 
limiting.

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31398) Speed up reading dates in ORC

2020-04-09 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31398:
--

 Summary: Speed up reading dates in ORC
 Key: SPARK-31398
 URL: https://issues.apache.org/jira/browse/SPARK-31398
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, ORC datasource converts values of DATE type to java.sql.Date and the 
result to days since the epoch in Proleptic Gregorian calendar. ORC datasource 
does such conversion when 
spark.sql.orc.enableVectorizedReader is set to false.

The conversion to java.sql.Date is not necessary because we can use 
DaysWritable which performs rebasing in much more optimal way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31397) Support json_arrayAgg

2020-04-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31397.
--
Resolution: Won't Fix

> Support json_arrayAgg
> -
>
> Key: SPARK-31397
> URL: https://issues.apache.org/jira/browse/SPARK-31397
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON array by aggregating all the JSON arrays from a set of JSON 
> arrays, or by aggregating the values of a Column.
> Some of the Databases supporting this aggregate function are:
>  * MySQL
>  * PostgreSQL
>  * Maria_DB
>  * Sqlite
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31396) Support json_objectAgg function

2020-04-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31396.
--
Resolution: Won't Fix

> Support json_objectAgg function
> ---
>
> Key: SPARK-31396
> URL: https://issues.apache.org/jira/browse/SPARK-31396
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON object containing the key-value pairs by aggregating the 
> key-values of set of Objects or columns. 
>  
> This aggregate function is supported by: 
>  * MySQL
>  * PostgreSQL
>  * IBM Db2
>  * Maria_DB
>  * Sqlite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31397) Support json_arrayAgg

2020-04-09 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079268#comment-17079268
 ] 

Rakesh Raushan commented on SPARK-31397:


The same functionality can be achieved using to/from_json and performance will 
be almost equivalent as well. So we do not need to implement this new function.

> Support json_arrayAgg
> -
>
> Key: SPARK-31397
> URL: https://issues.apache.org/jira/browse/SPARK-31397
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON array by aggregating all the JSON arrays from a set of JSON 
> arrays, or by aggregating the values of a Column.
> Some of the Databases supporting this aggregate function are:
>  * MySQL
>  * PostgreSQL
>  * Maria_DB
>  * Sqlite
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31397) Support json_arrayAgg

2020-04-09 Thread Rakesh Raushan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31397:
---
Comment: was deleted

(was: I am working on it.)

> Support json_arrayAgg
> -
>
> Key: SPARK-31397
> URL: https://issues.apache.org/jira/browse/SPARK-31397
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON array by aggregating all the JSON arrays from a set of JSON 
> arrays, or by aggregating the values of a Column.
> Some of the Databases supporting this aggregate function are:
>  * MySQL
>  * PostgreSQL
>  * Maria_DB
>  * Sqlite
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31396) Support json_objectAgg function

2020-04-09 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079267#comment-17079267
 ] 

Rakesh Raushan commented on SPARK-31396:


We can achieve the same functionality using to/from_json. So we do not need 
this new function.

 

> Support json_objectAgg function
> ---
>
> Key: SPARK-31396
> URL: https://issues.apache.org/jira/browse/SPARK-31396
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON object containing the key-value pairs by aggregating the 
> key-values of set of Objects or columns. 
>  
> This aggregate function is supported by: 
>  * MySQL
>  * PostgreSQL
>  * IBM Db2
>  * Maria_DB
>  * Sqlite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31396) Support json_objectAgg function

2020-04-09 Thread Rakesh Raushan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31396:
---
Comment: was deleted

(was: I am working on it.)

> Support json_objectAgg function
> ---
>
> Key: SPARK-31396
> URL: https://issues.apache.org/jira/browse/SPARK-31396
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON object containing the key-value pairs by aggregating the 
> key-values of set of Objects or columns. 
>  
> This aggregate function is supported by: 
>  * MySQL
>  * PostgreSQL
>  * IBM Db2
>  * Maria_DB
>  * Sqlite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31291) SQLQueryTestSuite: Sharing test data and test tables among multiple test cases.

2020-04-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31291:
---

Assignee: jiaan.geng

> SQLQueryTestSuite: Sharing test data and test tables among multiple test 
> cases.
> ---
>
> Key: SPARK-31291
> URL: https://issues.apache.org/jira/browse/SPARK-31291
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Minor
>
> SQLQueryTestSuite spend 35 minutes time to test.
> I checked the code and found SQLQueryTestSuite load test data repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31291) SQLQueryTestSuite: Sharing test data and test tables among multiple test cases.

2020-04-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31291.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28060
[https://github.com/apache/spark/pull/28060]

> SQLQueryTestSuite: Sharing test data and test tables among multiple test 
> cases.
> ---
>
> Key: SPARK-31291
> URL: https://issues.apache.org/jira/browse/SPARK-31291
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Minor
> Fix For: 3.0.0
>
>
> SQLQueryTestSuite spend 35 minutes time to test.
> I checked the code and found SQLQueryTestSuite load test data repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31395) [SPARK][Core]preferred location causing single node be a hot spot

2020-04-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31395.
--
Resolution: Not A Problem

> [SPARK][Core]preferred location causing single node be a hot spot
> -
>
> Key: SPARK-31395
> URL: https://issues.apache.org/jira/browse/SPARK-31395
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.5
>Reporter: StephenZou
>Priority: Minor
>
> my job is run as follows:
>  # build model, model is saved in HDFS named prob1. prob2. probN
>  # then load it to RDD from certain ProbInputformat
>  # do some calculation
> The driver node which builds the model is scheduled more frequently than 
> other nodes, because the HDFS block is firstly written to itself. The 
> scheduling hot spot is unnecessary, and should better be flattened.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31397) Support json_arrayAgg

2020-04-09 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079226#comment-17079226
 ] 

Rakesh Raushan commented on SPARK-31397:


I am working on it.

> Support json_arrayAgg
> -
>
> Key: SPARK-31397
> URL: https://issues.apache.org/jira/browse/SPARK-31397
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON array by aggregating all the JSON arrays from a set of JSON 
> arrays, or by aggregating the values of a Column.
> Some of the Databases supporting this aggregate function are:
>  * MySQL
>  * PostgreSQL
>  * Maria_DB
>  * Sqlite
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31397) Support json_arrayAgg

2020-04-09 Thread Rakesh Raushan (Jira)

Rakesh Raushan created SPARK-31397:
--

 Summary: Support json_arrayAgg
 Key: SPARK-31397
 URL: https://issues.apache.org/jira/browse/SPARK-31397
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Rakesh Raushan


Returns a JSON array by aggregating all the JSON arrays from a set of JSON 
arrays, or by aggregating the values of a Column.

Some of the Databases supporting this aggregate function are:
 * MySQL
 * PostgreSQL
 * Maria_DB
 * Sqlite
 * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31396) Support json_objectAgg function

2020-04-09 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079198#comment-17079198
 ] 

Rakesh Raushan commented on SPARK-31396:


I am working on it.

> Support json_objectAgg function
> ---
>
> Key: SPARK-31396
> URL: https://issues.apache.org/jira/browse/SPARK-31396
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON object containing the key-value pairs by aggregating the 
> key-values of set of Objects or columns. 
>  
> This aggregate function is supported by: 
>  * MySQL
>  * PostgreSQL
>  * IBM Db2
>  * Maria_DB
>  * Sqlite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31396) Support json_objectAgg function

2020-04-09 Thread Rakesh Raushan (Jira)

Rakesh Raushan created SPARK-31396:
--

 Summary: Support json_objectAgg function
 Key: SPARK-31396
 URL: https://issues.apache.org/jira/browse/SPARK-31396
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Rakesh Raushan


Returns a JSON object containing the key-value pairs by aggregating the 
key-values of set of Objects or columns. 

 

This aggregate function is supported by: 
 * MySQL
 * PostgreSQL
 * IBM Db2
 * Maria_DB
 * Sqlite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31395) [SPARK][Core]preferred location causing single node be a hot spot

2020-04-09 Thread StephenZou (Jira)

StephenZou created SPARK-31395:
--

 Summary: [SPARK][Core]preferred location causing single node be a 
hot spot
 Key: SPARK-31395
 URL: https://issues.apache.org/jira/browse/SPARK-31395
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.5, 2.3.4
Reporter: StephenZou


my job is run as follows:
 # build model, model is saved in HDFS named prob1. prob2. probN
 # then load it to RDD from certain ProbInputformat
 # do some calculation

The driver node which builds the model is scheduled more frequently than other 
nodes, because the HDFS block is firstly written to itself. The scheduling hot 
spot is unnecessary, and should better be flattened.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31106) Support is_json function

2020-04-09 Thread Rakesh Raushan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31106:
---
Description: 
This function will allow users to verify whether the given string is valid JSON 
or not. It returns `true` for valid JSON and `false` for invalid JSON. `NULL` 
is returned for `NULL` input.

DBMSs supporting this functions are :
 * MySQL
 * SQL Server
 * Sqlite
 * MariaDB
 * Amazon Redshift
 * IBM Db2

  was:
Currently, null is returned when we come across invalid json. We should either 
throw an exception for invalid json or false should be returned, like in other 
DBMSs. Like in `json_array_length` function we need to return NULL for null 
array. So this might confuse users.

 

DBMSs supporting this functions are :
 * MySQL
 * SQL Server
 * Sqlite
 * MariaDB
 * Amazon Redshift


> Support is_json function
> 
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will allow users to verify whether the given string is valid 
> JSON or not. It returns `true` for valid JSON and `false` for invalid JSON. 
> `NULL` is returned for `NULL` input.
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB
>  * Amazon Redshift
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31106) Support is_json function

2020-04-09 Thread Rakesh Raushan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31106:
---
Summary: Support is_json function  (was: Support IS_JSON)

> Support is_json function
> 
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Currently, null is returned when we come across invalid json. We should 
> either throw an exception for invalid json or false should be returned, like 
> in other DBMSs. Like in `json_array_length` function we need to return NULL 
> for null array. So this might confuse users.
>  
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB
>  * Amazon Redshift



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python

2020-04-09 Thread chiranjeevi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079165#comment-17079165
 ] 

chiranjeevi commented on SPARK-28902:
-

Hi team,

i am having the same issue, May i know how can this be fixed ?

> Spark ML Pipeline with nested Pipelines fails to load when saved from Python
> 
>
> Key: SPARK-28902
> URL: https://issues.apache.org/jira/browse/SPARK-28902
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Saif Addin
>Priority: Minor
>
> Hi, this error is affecting a bunch of our nested use cases.
> Saving a *PipelineModel* with one of its stages being another 
> *PipelineModel*, fails when loading it from Scala if it is saved in Python.
> *Python side:*
>  
> {code:java}
> from pyspark.ml import Pipeline
> from pyspark.ml.feature import Tokenizer
> t = Tokenizer()
> p = Pipeline().setStages([t])
> d = spark.createDataFrame([["Hello Peter Parker"]])
> pm = p.fit(d)
> np = Pipeline().setStages([pm])
> npm = np.fit(d)
> npm.write().save('./npm_test')
> {code}
>  
>  
> *Scala side:*
>  
> {code:java}
> scala> import org.apache.spark.ml.PipelineModel
> scala> val pp = PipelineModel.load("./npm_test")
> java.lang.IllegalArgumentException: requirement failed: Error loading 
> metadata: Expected class name org.apache.spark.ml.PipelineModel but found 
> class name pyspark.ml.pipeline.PipelineModel
>  at scala.Predef$.require(Predef.scala:224)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
>  at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332)
>  ... 50 elided
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31393) Show the correct alias in a more elegant way

2020-04-09 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-31393:
---
Description: 
Spark SQL exists some function no elegant implementation alias.
For example: BitwiseCount override the sql method

{code:java}
override def sql: String = s"bit_count(${child.sql})" 
{code}

I don't think it's elegant enough.
Because `Expression` gives the following definitions.

{code:java}
  def sql: String = {
val childrenSQL = children.map(_.sql).mkString(", ")
s"$prettyName($childrenSQL)"
  }
{code}

By this definition, BitwiseCount should override `prettyName` method.

  was:
Spark SQL exists some function no elegant implementation alias.
For example: BitwiseCount override the sql method
override def sql: String = s"bit_count(${child.sql})" 
I don't think it's elegant enough.
Because `Expression` gives the following definitions.
```
  def sql: String = {
val childrenSQL = children.map(_.sql).mkString(", ")
s"$prettyName($childrenSQL)"
  }
```
By this definition, BitwiseCount should override `prettyName` method.


> Show the correct alias in a more elegant way
> 
>
> Key: SPARK-31393
> URL: https://issues.apache.org/jira/browse/SPARK-31393
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL exists some function no elegant implementation alias.
> For example: BitwiseCount override the sql method
> {code:java}
> override def sql: String = s"bit_count(${child.sql})" 
> {code}
> I don't think it's elegant enough.
> Because `Expression` gives the following definitions.
> {code:java}
>   def sql: String = {
> val childrenSQL = children.map(_.sql).mkString(", ")
> s"$prettyName($childrenSQL)"
>   }
> {code}
> By this definition, BitwiseCount should override `prettyName` method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31106) Support IS_JSON

2020-04-09 Thread Rakesh Raushan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31106:
---
Description: 
Currently, null is returned when we come across invalid json. We should either 
throw an exception for invalid json or false should be returned, like in other 
DBMSs. Like in `json_array_length` function we need to return NULL for null 
array. So this might confuse users.

 

DBMSs supporting this functions are :
 * MySQL
 * SQL Server
 * Sqlite
 * MariaDB
 * Amazon Redshift

  was:
Currently, null is returned when we come across invalid json. We should either 
throw an exception for invalid json or false should be returned, like in other 
DBMSs. Like in `json_array_length` function we need to return NULL for null 
array. So this might confuse users.

 

DBMSs supporting this functions are :
 * MySQL
 * SQL Server
 * Sqlite
 * MariaDB


> Support IS_JSON
> ---
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Currently, null is returned when we come across invalid json. We should 
> either throw an exception for invalid json or false should be returned, like 
> in other DBMSs. Like in `json_array_length` function we need to return NULL 
> for null array. So this might confuse users.
>  
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB
>  * Amazon Redshift



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31291) SQLQueryTestSuite: Sharing test data and test tables among multiple test cases.

2020-04-09 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-31291:
---
Summary: SQLQueryTestSuite: Sharing test data and test tables among 
multiple test cases.  (was: SQLQueryTestSuite: Avoid load test data if test 
case not uses them.)

> SQLQueryTestSuite: Sharing test data and test tables among multiple test 
> cases.
> ---
>
> Key: SPARK-31291
> URL: https://issues.apache.org/jira/browse/SPARK-31291
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Minor
>
> SQLQueryTestSuite spend 35 minutes time to test.
> I checked the code and found SQLQueryTestSuite load test data repeatedly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31394) Support for Kubernetes NFS volume mounts

2020-04-09 Thread Seongjin Cho (Jira)

Seongjin Cho created SPARK-31394:


 Summary: Support for Kubernetes NFS volume mounts
 Key: SPARK-31394
 URL: https://issues.apache.org/jira/browse/SPARK-31394
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3
Reporter: Seongjin Cho


Kubernetes supports various kinds of volumes, but Spark for Kubernetes supports 
only EmptyDir/HostDir/PVC. By adding support for NFS, we can use Spark for 
Kubernetes with NFS storage.

NFS could be used using PVC when we want to use some clean new empty disk 
space, but in order to use files in existing NFS shares, we need to use NFS 
volume mounts.

RP link: [https://github.com/apache/spark/pull/27364]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31393) Show the correct alias in a more elegant way

2020-04-09 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-31393:
--

 Summary: Show the correct alias in a more elegant way
 Key: SPARK-31393
 URL: https://issues.apache.org/jira/browse/SPARK-31393
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: jiaan.geng


Spark SQL exists some function no elegant implementation alias.
For example: BitwiseCount override the sql method
override def sql: String = s"bit_count(${child.sql})" 
I don't think it's elegant enough.
Because `Expression` gives the following definitions.
```
  def sql: String = {
val childrenSQL = children.map(_.sql).mkString(", ")
s"$prettyName($childrenSQL)"
  }
```
By this definition, BitwiseCount should override `prettyName` method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31392) Support CalendarInterval to be reflect to CalendarIntervalType

2020-04-09 Thread Kent Yao (Jira)

Kent Yao created SPARK-31392:


 Summary: Support CalendarInterval to be reflect to 
CalendarIntervalType
 Key: SPARK-31392
 URL: https://issues.apache.org/jira/browse/SPARK-31392
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


Since Spark 3.0.0, we make CalendarInterval public, it's better for it to be 
inferred to CalendarIntervalType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31391) Add AdaptiveTestUtils to ease the test of AQE

2020-04-09 Thread wuyi (Jira)

wuyi created SPARK-31391:


 Summary: Add AdaptiveTestUtils to ease the test of AQE
 Key: SPARK-31391
 URL: https://issues.apache.org/jira/browse/SPARK-31391
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: wuyi


Tests related to AQE now have much duplicate codes, we can use some utility 
functions to make the test simpler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat

2020-04-09 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078958#comment-17078958
 ] 

zhengruifeng commented on SPARK-31301:
--

[~srowen] There are two methods now:
{code:java}
@Since("2.2.0")
def test(dataset: DataFrame, featuresCol: String, labelCol: String): DataFrame 
= {
  val spark = dataset.sparkSession
  import spark.implicits._

  SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT)
  SchemaUtils.checkNumericType(dataset.schema, labelCol)
  val rdd = dataset.select(col(labelCol).cast("double"), 
col(featuresCol)).as[(Double, Vector)]
.rdd.map { case (label, features) => OldLabeledPoint(label, 
OldVectors.fromML(features)) }
  val testResults = OldStatistics.chiSqTest(rdd)
  val pValues = Vectors.dense(testResults.map(_.pValue))
  val degreesOfFreedom = testResults.map(_.degreesOfFreedom)
  val statistics = Vectors.dense(testResults.map(_.statistic))
  spark.createDataFrame(Seq(ChiSquareResult(pValues, degreesOfFreedom, 
statistics)))
}

@Since("3.1.0")
def testChiSquare(
dataset: Dataset[_],
featuresCol: String,
labelCol: String): Array[SelectionTestResult] = {

  SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT)
  SchemaUtils.checkNumericType(dataset.schema, labelCol)
  val input = dataset.select(col(labelCol).cast(DoubleType), 
col(featuresCol)).rdd
.map { case Row(label: Double, features: Vector) =>
  OldLabeledPoint(label, OldVectors.fromML(features))
}
  val chiTestResult = OldStatistics.chiSqTest(input)
  chiTestResult.map(r => new ChiSqTestResult(r.pValue, r.degreesOfFreedom, 
r.statistic))
} {code}
 

The newly added one is targeted to 3.1.0, so we can modify its return type 
without breaking api

 

 

> flatten the result dataframe of tests in stat
> -
>
> Key: SPARK-31301
> URL: https://issues.apache.org/jira/browse/SPARK-31301
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> {code:java}
>  scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import 
> org.apache.spark.ml.stat.ChiSquareTest
> import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq(
>  |   (0.0, Vectors.dense(0.5, 10.0)),
>  |   (0.0, Vectors.dense(1.5, 20.0)),
>  |   (1.0, Vectors.dense(1.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 40.0)),
>  |   (1.0, Vectors.dense(3.5, 40.0))
>  | )
> data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = 
> List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), 
> (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = 
> data.toDF("label", "features")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala>  
>val chi = ChiSquareTest.test(df, "features", "label")
> chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: 
> array ... 1 more field]scala> chi.show
> +++--+
> | pValues|degreesOfFreedom|statistics|
> +++--+
> |[0.68728927879097...|  [2, 3]|[0.75,1.5]|
> +++--+{code}
>  
> Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, 
> {{Correlation}} all return a df only containing one row.
> I think this is quite hard to use, suppose we have a dataset with dim=1000, 
> the only operation we can deal with the test result is to collect it by 
> {{head()}} or {{first(), and then use it in the driver.}}
> {{While what I really want to do is filtering the df like pValue>0.1}} or 
> {{corr<0.5}}, *So I suggest to flatten the output df in those tests.*
>  
> {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but 
> ChiSquareTest and Correlation were here for a long time.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

76 matches

Mail list logo