[jira] [Commented] (SPARK-31399) closure cleaner is broken in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080265#comment-17080265 ] Reynold Xin commented on SPARK-31399: - This is bad... [~sowen] and [~joshrosen] did you look into this in the past? > closure cleaner is broken in Spark 3.0 > -- > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31402) Incorrect rebasing of BCE dates
[ https://issues.apache.org/jira/browse/SPARK-31402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31402: Parent: SPARK-31404 Issue Type: Sub-task (was: Bug) > Incorrect rebasing of BCE dates > --- > > Key: SPARK-31402 > URL: https://issues.apache.org/jira/browse/SPARK-31402 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Dates of before common era are rebased incorrectly, see > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120679/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql/ > {code} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > postgreSQL/date.sql > Expected "[-0044]-03-15", but got "[0045]-03-15" Result did not match for > query #93 > select make_date(-44, 3, 15) > {code} > Even such dates are out of the valid range of dates supported by the DATE > type, there is a test in postgreSQL/date.sql for a negative year, and it > would be nice to fix the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set
[ https://issues.apache.org/jira/browse/SPARK-31386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080257#comment-17080257 ] JinxinTang edited comment on SPARK-31386 at 4/10/20, 5:32 AM: -- or could you please try the spark 2.4.5, also no problem could be reproduced [download link|https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz] was (Author: jinxintang): or could you please try the spark 2.4.5, also no problem could be reproduces [download link|https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz] > Reading broadcast in UDF raises MemoryError when > spark.executor.pyspark.memory is set > - > > Key: SPARK-31386 > URL: https://issues.apache.org/jira/browse/SPARK-31386 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 or AWS EMR > `pyspark --conf spark.executor.pyspark.memory=500m` >Reporter: Viacheslav Krot >Priority: Major > Attachments: 选区_267.png > > > Following code with udf causes MemoryError when > `spark.executor.pyspark.memory` is set > ``` > from pyspark.sql.types import BooleanType > from pyspark.sql.functions import udf > df = spark.createDataFrame([ > ('Alice', 10), > ('Bob', 12) > ], ['name', 'cnt']) > broadcast = spark.sparkContext.broadcast([1,2,3]) > @udf(BooleanType()) > def f(cnt): > return cnt < len(broadcast.value) > df.filter(f(df.cnt)).count() > ``` > Same code work well when spark.executor.pyspark.memory is not set. > The code by itself does not make any sense, just simplest code to reproduce > the bug. > > Error: > ``` > 20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, > ip-172-31-32-201.us-east-2.compute.internal, executor 2): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID > 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 377, in main process() File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 372, in process serializer.dump_stream(func(split_index, iterator), > outfile) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 345, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 141, in dump_stream for obj in iterator: File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 334, in _batched for item in iterator: File "", line 1, in > File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 85, in return lambda *a: f(*a) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py", > line 113, in wrapper return f(*args, **kwargs) File "", line 3, > in f File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", > line 148, in value self._value = self.load_from_path(self._path) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", > line 124, in load_from_path with open(path, 'rb', 1 << 20) as > f:MemoryError > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at >
[jira] [Commented] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set
[ https://issues.apache.org/jira/browse/SPARK-31386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080257#comment-17080257 ] JinxinTang commented on SPARK-31386: or could you please try the spark 2.4.5, also no problem could be reproduces [download link|https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz] > Reading broadcast in UDF raises MemoryError when > spark.executor.pyspark.memory is set > - > > Key: SPARK-31386 > URL: https://issues.apache.org/jira/browse/SPARK-31386 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 or AWS EMR > `pyspark --conf spark.executor.pyspark.memory=500m` >Reporter: Viacheslav Krot >Priority: Major > Attachments: 选区_267.png > > > Following code with udf causes MemoryError when > `spark.executor.pyspark.memory` is set > ``` > from pyspark.sql.types import BooleanType > from pyspark.sql.functions import udf > df = spark.createDataFrame([ > ('Alice', 10), > ('Bob', 12) > ], ['name', 'cnt']) > broadcast = spark.sparkContext.broadcast([1,2,3]) > @udf(BooleanType()) > def f(cnt): > return cnt < len(broadcast.value) > df.filter(f(df.cnt)).count() > ``` > Same code work well when spark.executor.pyspark.memory is not set. > The code by itself does not make any sense, just simplest code to reproduce > the bug. > > Error: > ``` > 20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, > ip-172-31-32-201.us-east-2.compute.internal, executor 2): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID > 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 377, in main process() File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 372, in process serializer.dump_stream(func(split_index, iterator), > outfile) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 345, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 141, in dump_stream for obj in iterator: File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 334, in _batched for item in iterator: File "", line 1, in > File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 85, in return lambda *a: f(*a) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py", > line 113, in wrapper return f(*args, **kwargs) File "", line 3, > in f File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", > line 148, in value self._value = self.load_from_path(self._path) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", > line 124, in load_from_path with open(path, 'rb', 1 << 20) as > f:MemoryError > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown > Source) at >
[jira] [Assigned] (SPARK-24963) Integration tests will fail if they run in a namespace not being the default
[ https://issues.apache.org/jira/browse/SPARK-24963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-24963: - Assignee: Matthew Cheah > Integration tests will fail if they run in a namespace not being the default > > > Key: SPARK-24963 > URL: https://issues.apache.org/jira/browse/SPARK-24963 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Assignee: Matthew Cheah >Priority: Minor > Fix For: 2.4.0 > > > Related discussion is here: > [https://github.com/apache/spark/pull/21748#pullrequestreview-141048893] > If spark-rbac.yaml is used when tests are used locally, client mode tests > will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31401) Add JDK11 example in `bin/docker-image-tool.sh` usage
[ https://issues.apache.org/jira/browse/SPARK-31401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31401. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28171 [https://github.com/apache/spark/pull/28171] > Add JDK11 example in `bin/docker-image-tool.sh` usage > - > > Key: SPARK-31401 > URL: https://issues.apache.org/jira/browse/SPARK-31401 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > > {code:java} > $ bin/docker-image-tool.sh -h > ... > - Build and push JDK11-based image with tag "v3.0.0" to docker.io/myrepo > bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 -b > java_image_tag=11-jre-slim build > bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 push {code} > > {code:java} > $ docker run -it --rm docker.io/myrepo/spark:v3.0.0 java --version | tail -n3 > openjdk 11.0.6 2020-01-14 > OpenJDK Runtime Environment 18.9 (build 11.0.6+10) > OpenJDK 64-Bit Server VM 18.9 (build 11.0.6+10, mixed mode) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31404) backward compatibility issues after switching to Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31404: Summary: backward compatibility issues after switching to Proleptic Gregorian calendar (was: backward compatibility after switching to Proleptic Gregorian calendar) > backward compatibility issues after switching to Proleptic Gregorian calendar > - > > Key: SPARK-31404 > URL: https://issues.apache.org/jira/browse/SPARK-31404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java > 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but > introduces some backward compatibility problems: > 1. may read wrong data from the data files written by Spark 2.4 > 2. may have perf regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set
[ https://issues.apache.org/jira/browse/SPARK-31386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080205#comment-17080205 ] JinxinTang commented on SPARK-31386: is it already fix in the 2.4.4 version? it seems ok both in local mode and yarn mode, could you please provide other detail informations if this bug still live,or could you try the version from spark release version and reproduce in local enviroment > Reading broadcast in UDF raises MemoryError when > spark.executor.pyspark.memory is set > - > > Key: SPARK-31386 > URL: https://issues.apache.org/jira/browse/SPARK-31386 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 or AWS EMR > `pyspark --conf spark.executor.pyspark.memory=500m` >Reporter: Viacheslav Krot >Priority: Major > Attachments: 选区_267.png > > > Following code with udf causes MemoryError when > `spark.executor.pyspark.memory` is set > ``` > from pyspark.sql.types import BooleanType > from pyspark.sql.functions import udf > df = spark.createDataFrame([ > ('Alice', 10), > ('Bob', 12) > ], ['name', 'cnt']) > broadcast = spark.sparkContext.broadcast([1,2,3]) > @udf(BooleanType()) > def f(cnt): > return cnt < len(broadcast.value) > df.filter(f(df.cnt)).count() > ``` > Same code work well when spark.executor.pyspark.memory is not set. > The code by itself does not make any sense, just simplest code to reproduce > the bug. > > Error: > ``` > 20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, > ip-172-31-32-201.us-east-2.compute.internal, executor 2): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID > 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 377, in main process() File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 372, in process serializer.dump_stream(func(split_index, iterator), > outfile) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 345, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 141, in dump_stream for obj in iterator: File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 334, in _batched for item in iterator: File "", line 1, in > File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 85, in return lambda *a: f(*a) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py", > line 113, in wrapper return f(*args, **kwargs) File "", line 3, > in f File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", > line 148, in value self._value = self.load_from_path(self._path) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", > line 124, in load_from_path with open(path, 'rb', 1 << 20) as > f:MemoryError > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown > Source)
[jira] [Updated] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set
[ https://issues.apache.org/jira/browse/SPARK-31386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JinxinTang updated SPARK-31386: --- Attachment: 选区_267.png > Reading broadcast in UDF raises MemoryError when > spark.executor.pyspark.memory is set > - > > Key: SPARK-31386 > URL: https://issues.apache.org/jira/browse/SPARK-31386 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 > Environment: Spark 2.4.4 or AWS EMR > `pyspark --conf spark.executor.pyspark.memory=500m` >Reporter: Viacheslav Krot >Priority: Major > Attachments: 选区_267.png > > > Following code with udf causes MemoryError when > `spark.executor.pyspark.memory` is set > ``` > from pyspark.sql.types import BooleanType > from pyspark.sql.functions import udf > df = spark.createDataFrame([ > ('Alice', 10), > ('Bob', 12) > ], ['name', 'cnt']) > broadcast = spark.sparkContext.broadcast([1,2,3]) > @udf(BooleanType()) > def f(cnt): > return cnt < len(broadcast.value) > df.filter(f(df.cnt)).count() > ``` > Same code work well when spark.executor.pyspark.memory is not set. > The code by itself does not make any sense, just simplest code to reproduce > the bug. > > Error: > ``` > 20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, > ip-172-31-32-201.us-east-2.compute.internal, executor 2): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID > 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 377, in main process() File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 372, in process serializer.dump_stream(func(split_index, iterator), > outfile) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 345, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 141, in dump_stream for obj in iterator: File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", > line 334, in _batched for item in iterator: File "", line 1, in > File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", > line 85, in return lambda *a: f(*a) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py", > line 113, in wrapper return f(*args, **kwargs) File "", line 3, > in f File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", > line 148, in value self._value = self.load_from_path(self._path) File > "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", > line 124, in load_from_path with open(path, 'rb', 1 << 20) as > f:MemoryError > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at >
[jira] [Commented] (SPARK-31406) ThriftServerQueryTestSuite: Sharing test data and test tables among multiple test cases.
[ https://issues.apache.org/jira/browse/SPARK-31406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080203#comment-17080203 ] jiaan.geng commented on SPARK-31406: I'm working on. > ThriftServerQueryTestSuite: Sharing test data and test tables among multiple > test cases. > > > Key: SPARK-31406 > URL: https://issues.apache.org/jira/browse/SPARK-31406 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Minor > > ThriftServerQueryTestSuite spend 17 minutes time to test. > I checked the code and found ThriftServerQueryTestSuite load test data > repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31406) ThriftServerQueryTestSuite: Sharing test data and test tables among multiple test cases.
jiaan.geng created SPARK-31406: -- Summary: ThriftServerQueryTestSuite: Sharing test data and test tables among multiple test cases. Key: SPARK-31406 URL: https://issues.apache.org/jira/browse/SPARK-31406 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: jiaan.geng ThriftServerQueryTestSuite spend 17 minutes time to test. I checked the code and found ThriftServerQueryTestSuite load test data repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31361) Rebase datetime in parquet/avro according to file written Spark version
[ https://issues.apache.org/jira/browse/SPARK-31361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31361: Parent: SPARK-31404 Issue Type: Sub-task (was: Bug) > Rebase datetime in parquet/avro according to file written Spark version > --- > > Key: SPARK-31361 > URL: https://issues.apache.org/jira/browse/SPARK-31361 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080193#comment-17080193 ] Wenchen Fan commented on SPARK-30951: - FTYI I've created a ticket for the fail-by-default behavior: https://issues.apache.org/jira/browse/SPARK-31405 > Potential data loss for legacy applications after switch to proleptic > Gregorian calendar > > > Key: SPARK-30951 > URL: https://issues.apache.org/jira/browse/SPARK-30951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Assignee: Maxim Gekk >Priority: Blocker > Labels: release-notes > Fix For: 3.0.0 > > > tl;dr: We recently discovered some Spark 2.x sites that have lots of data > containing dates before October 15, 1582. This could be an issue when such > sites try to upgrade to Spark 3.0. > From SPARK-26651: > {quote}"The changes might impact on the results for dates and timestamps > before October 15, 1582 (Gregorian) > {quote} > We recently discovered that some large scale Spark 2.x applications rely on > dates before October 15, 1582. > Two cases came up recently: > * An application that uses a commercial third-party library to encode > sensitive dates. On insert, the library encodes the actual date as some other > date. On select, the library decodes the date back to the original date. The > encoded value could be any date, including one before October 15, 1582 (e.g., > "0602-04-04"). > * An application that uses a specific unlikely date (e.g., "1200-01-01") as > a marker to indicate "unknown date" (in lieu of null) > Both sites ran into problems after another component in their system was > upgraded to use the proleptic Gregorian calendar. Spark applications that > read files created by the upgraded component were interpreting encoded or > marker dates incorrectly, and vice versa. Also, their data now had a mix of > calendars (hybrid and proleptic Gregorian) with no metadata to indicate which > file used which calendar. > Both sites had enormous amounts of existing data, so re-encoding the dates > using some other scheme was not a feasible solution. > This is relevant to Spark 3: > Any Spark 2 application that uses such date-encoding schemes may run into > trouble when run on Spark 3. The application may not properly interpret the > dates previously written by Spark 2. Also, once the Spark 3 version of the > application writes data, the tables will have a mix of calendars (hybrid and > proleptic gregorian) with no metadata to indicate which file uses which > calendar. > Similarly, sites might run with mixed Spark versions, resulting in data > written by one version that cannot be interpreted by the other. And as above, > the tables will now have a mix of calendars with no way to detect which file > uses which calendar. > As with the two real-life example cases, these applications may have enormous > amounts of legacy data, so re-encoding the dates using some other scheme may > not be feasible. > We might want to consider a configuration setting to allow the user to > specify the calendar for storing and retrieving date and timestamp values > (not sure how such a flag would affect other date and timestamp-related > functions). I realize the change is far bigger than just adding a > configuration setting. > Here's a quick example of where trouble may happen, using the real-life case > of the marker date. > In Spark 2.4: > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 1 > scala> > {noformat} > In Spark 3.0 (reading from the same legacy file): > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 0 > scala> > {noformat} > By the way, Hive had a similar problem. Hive switched from hybrid calendar to > proleptic Gregorian calendar between 2.x and 3.x. After some upgrade > headaches related to dates before 1582, the Hive community made the following > changes: > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > checks a configuration setting to determine which calendar to use. > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > stores the calendar type in the metadata. > * When reading date or timestamp data from ORC, Parquet, and Avro files, > Hive checks the metadata for the calendar type. > * When reading date or timestamp data from ORC, Parquet, and Avro files that > lack calendar metadata, Hive's behavior is determined by a configuration > setting. This allows Hive to read legacy data (note: if the data already > consists of a mix of calendar types with no metadata,
[jira] [Comment Edited] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080193#comment-17080193 ] Wenchen Fan edited comment on SPARK-30951 at 4/10/20, 3:40 AM: --- FYI I've created a ticket for the fail-by-default behavior: https://issues.apache.org/jira/browse/SPARK-31405 was (Author: cloud_fan): FTYI I've created a ticket for the fail-by-default behavior: https://issues.apache.org/jira/browse/SPARK-31405 > Potential data loss for legacy applications after switch to proleptic > Gregorian calendar > > > Key: SPARK-30951 > URL: https://issues.apache.org/jira/browse/SPARK-30951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Assignee: Maxim Gekk >Priority: Blocker > Labels: release-notes > Fix For: 3.0.0 > > > tl;dr: We recently discovered some Spark 2.x sites that have lots of data > containing dates before October 15, 1582. This could be an issue when such > sites try to upgrade to Spark 3.0. > From SPARK-26651: > {quote}"The changes might impact on the results for dates and timestamps > before October 15, 1582 (Gregorian) > {quote} > We recently discovered that some large scale Spark 2.x applications rely on > dates before October 15, 1582. > Two cases came up recently: > * An application that uses a commercial third-party library to encode > sensitive dates. On insert, the library encodes the actual date as some other > date. On select, the library decodes the date back to the original date. The > encoded value could be any date, including one before October 15, 1582 (e.g., > "0602-04-04"). > * An application that uses a specific unlikely date (e.g., "1200-01-01") as > a marker to indicate "unknown date" (in lieu of null) > Both sites ran into problems after another component in their system was > upgraded to use the proleptic Gregorian calendar. Spark applications that > read files created by the upgraded component were interpreting encoded or > marker dates incorrectly, and vice versa. Also, their data now had a mix of > calendars (hybrid and proleptic Gregorian) with no metadata to indicate which > file used which calendar. > Both sites had enormous amounts of existing data, so re-encoding the dates > using some other scheme was not a feasible solution. > This is relevant to Spark 3: > Any Spark 2 application that uses such date-encoding schemes may run into > trouble when run on Spark 3. The application may not properly interpret the > dates previously written by Spark 2. Also, once the Spark 3 version of the > application writes data, the tables will have a mix of calendars (hybrid and > proleptic gregorian) with no metadata to indicate which file uses which > calendar. > Similarly, sites might run with mixed Spark versions, resulting in data > written by one version that cannot be interpreted by the other. And as above, > the tables will now have a mix of calendars with no way to detect which file > uses which calendar. > As with the two real-life example cases, these applications may have enormous > amounts of legacy data, so re-encoding the dates using some other scheme may > not be feasible. > We might want to consider a configuration setting to allow the user to > specify the calendar for storing and retrieving date and timestamp values > (not sure how such a flag would affect other date and timestamp-related > functions). I realize the change is far bigger than just adding a > configuration setting. > Here's a quick example of where trouble may happen, using the real-life case > of the marker date. > In Spark 2.4: > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 1 > scala> > {noformat} > In Spark 3.0 (reading from the same legacy file): > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 0 > scala> > {noformat} > By the way, Hive had a similar problem. Hive switched from hybrid calendar to > proleptic Gregorian calendar between 2.x and 3.x. After some upgrade > headaches related to dates before 1582, the Hive community made the following > changes: > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > checks a configuration setting to determine which calendar to use. > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > stores the calendar type in the metadata. > * When reading date or timestamp data from ORC, Parquet, and Avro files, > Hive checks the metadata for the calendar type. > * When reading date or timestamp data from ORC, Parquet, and Avro files that > lack calendar
[jira] [Created] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not
Wenchen Fan created SPARK-31405: --- Summary: fail by default when read/write datetime values and not sure if they need rebase or not Key: SPARK-31405 URL: https://issues.apache.org/jira/browse/SPARK-31405 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31297) Speed-up date-time rebasing
[ https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31297: Parent Issue: SPARK-31404 (was: SPARK-30951) > Speed-up date-time rebasing > --- > > Key: SPARK-31297 > URL: https://issues.apache.org/jira/browse/SPARK-31297 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > I do believe it is possible to speed up date-time rebasing by building a map > of micros to diffs between original and rebased micros. And look up at the > map via binary search. > For example, the *America/Los_Angeles* time zone has less than 100 points > when diff changes: > {code:scala} > test("optimize rebasing") { > val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > var micros = start > var diff = Long.MaxValue > var counter = 0 > while (micros < end) { > val rebased = rebaseGregorianToJulianMicros(micros) > val curDiff = rebased - micros > if (curDiff != diff) { > counter += 1 > diff = curDiff > val ldt = > microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime > println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} > minutes") > } > micros += MICROS_PER_HOUR > } > println(s"counter = $counter") > } > {code} > {code:java} > local date-time = 0001-01-01T00:00 diff = -2909 minutes > local date-time = 0100-02-28T14:00 diff = -1469 minutes > local date-time = 0200-02-28T14:00 diff = -29 minutes > local date-time = 0300-02-28T14:00 diff = 1410 minutes > local date-time = 0500-02-28T14:00 diff = 2850 minutes > local date-time = 0600-02-28T14:00 diff = 4290 minutes > local date-time = 0700-02-28T14:00 diff = 5730 minutes > local date-time = 0900-02-28T14:00 diff = 7170 minutes > local date-time = 1000-02-28T14:00 diff = 8610 minutes > local date-time = 1100-02-28T14:00 diff = 10050 minutes > local date-time = 1300-02-28T14:00 diff = 11490 minutes > local date-time = 1400-02-28T14:00 diff = 12930 minutes > local date-time = 1500-02-28T14:00 diff = 14370 minutes > local date-time = 1582-10-14T14:00 diff = -29 minutes > local date-time = 1899-12-31T16:52:58 diff = 0 minutes > local date-time = 1917-12-27T11:52:58 diff = 60 minutes > local date-time = 1917-12-27T12:52:58 diff = 0 minutes > local date-time = 1918-09-15T12:52:58 diff = 60 minutes > local date-time = 1918-09-15T13:52:58 diff = 0 minutes > local date-time = 1919-06-30T16:52:58 diff = 31 minutes > local date-time = 1919-06-30T17:52:58 diff = 0 minutes > local date-time = 1919-08-15T12:52:58 diff = 60 minutes > local date-time = 1919-08-15T13:52:58 diff = 0 minutes > local date-time = 1921-08-31T10:52:58 diff = 60 minutes > local date-time = 1921-08-31T11:52:58 diff = 0 minutes > local date-time = 1921-09-30T11:52:58 diff = 60 minutes > local date-time = 1921-09-30T12:52:58 diff = 0 minutes > local date-time = 1922-09-30T12:52:58 diff = 60 minutes > local date-time = 1922-09-30T13:52:58 diff = 0 minutes > local date-time = 1981-09-30T12:52:58 diff = 60 minutes > local date-time = 1981-09-30T13:52:58 diff = 0 minutes > local date-time = 1982-09-30T12:52:58 diff = 60 minutes > local date-time = 1982-09-30T13:52:58 diff = 0 minutes > local date-time = 1983-09-30T12:52:58 diff = 60 minutes > local date-time = 1983-09-30T13:52:58 diff = 0 minutes > local date-time = 1984-09-29T15:52:58 diff = 60 minutes > local date-time = 1984-09-29T16:52:58 diff = 0 minutes > local date-time = 1985-09-28T15:52:58 diff = 60 minutes > local date-time = 1985-09-28T16:52:58 diff = 0 minutes > local date-time = 1986-09-27T15:52:58 diff = 60 minutes > local date-time = 1986-09-27T16:52:58 diff = 0 minutes > local date-time = 1987-09-26T15:52:58 diff = 60 minutes > local date-time = 1987-09-26T16:52:58 diff = 0 minutes > local date-time = 1988-09-24T15:52:58 diff = 60 minutes > local date-time = 1988-09-24T16:52:58 diff = 0 minutes > local date-time = 1989-09-23T15:52:58 diff = 60 minutes > local date-time = 1989-09-23T16:52:58 diff = 0 minutes > local date-time = 1990-09-29T15:52:58 diff = 60 minutes > local date-time = 1990-09-29T16:52:58 diff = 0 minutes > local date-time = 1991-09-28T16:52:58 diff = 60 minutes > local date-time = 1991-09-28T17:52:58 diff = 0 minutes > local date-time = 1992-09-26T15:52:58 diff = 60 minutes > local date-time = 1992-09-26T16:52:58 diff = 0 minutes > local date-time = 1993-09-25T15:52:58 diff = 60 minutes > local date-time =
[jira] [Updated] (SPARK-31359) Speed up timestamps rebasing
[ https://issues.apache.org/jira/browse/SPARK-31359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31359: Parent: SPARK-31404 Issue Type: Sub-task (was: Improvement) > Speed up timestamps rebasing > > > Key: SPARK-31359 > URL: https://issues.apache.org/jira/browse/SPARK-31359 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Currently, rebasing of timestamps is performed via conversions to local > timestamps and back to microseconds. This is CPU intensive operation which > can be avoid by converting via pre-calculated tables per each time zone. For > example, the below is timestamps when diffs are changed in > America/Los_Angeles time zone for the range 0001-01-01...2100-01-01 > {code} > 0001-01-01T00:00 diff = -2872 minutes > 0100-03-01T00:00 diff = -1432 minutes > 0200-03-01T00:00 diff = 7 minutes > 0300-03-01T00:00 diff = 1447 minutes > 0500-03-01T00:00 diff = 2887 minutes > 0600-03-01T00:00 diff = 4327 minutes > 0700-03-01T00:00 diff = 5767 minutes > 0900-03-01T00:00 diff = 7207 minutes > 1000-03-01T00:00 diff = 8647 minutes > 1100-03-01T00:00 diff = 10087 minutes > 1300-03-01T00:00 diff = 11527 minutes > 1400-03-01T00:00 diff = 12967 minutes > 1500-03-01T00:00 diff = 14407 minutes > 1582-10-15T00:00 diff = 7 minutes > 1883-11-18T12:22:58 diff = 0 minutes > {code} > It seems it is possible to build rebasing maps, and perform rebasing via the > maps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31385) Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing
[ https://issues.apache.org/jira/browse/SPARK-31385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31385: Parent: SPARK-31404 Issue Type: Sub-task (was: Bug) > Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing > - > > Key: SPARK-31385 > URL: https://issues.apache.org/jira/browse/SPARK-31385 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Microseconds rebasing from the hybrid calendar (Julian + Gregorian) to > Proleptic Gregorian calendar is not symmetric to opposite conversion for the > following time zones: > # Asia/Tehran > # Iran > # Africa/Casablanca > # Africa/El_Aaiun > Here is the results from the https://github.com/apache/spark/pull/28119: > Julian -> Gregorian: > {code:json} > , { > "tz" : "Asia/Tehran", > "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, > -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, > -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, > -2208988800, 2547315000, 2547401400 ], > "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, > -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ] > }, { > "tz" : "Iran", > "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, > -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, > -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, > -2208988800, 2547315000, 2547401400 ], > "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, > -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ] > }, { > "tz" : "Africa/Casablanca", > "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, > -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, > -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, > -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, > 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, > 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, > 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, > 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, > 2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, > 2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, > 2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, > 2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, > 2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, > 3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, > 3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, > 3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, > 3332714400, 3335742000, 3362954400, 3366586800, 3393799200, 3396826800, > 3424644000, 3427671600, 3454884000, 3457911600, 3485728800, 3488756400, > 3515968800, 3519601200, 3546813600, 3549841200, 3577658400, 3580686000, > 3607898400, 3611530800, 3638743200, 3641770800, 3669588000, 3672615600, > 3699828000, 3702855600 ], > "diffs" : [ 174620, 88220, 1820, -84580, -170980, -257380, -343780, > -430180, -516580, -602980, -689380, -775780, -862180, 1820, 0, -3600, 0, > -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, > 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, > -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, > 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, > -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, > 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, > -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600 ] > }, { > "tz" : "Africa/El_Aaiun", > "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, > -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, > -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, > -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, > 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, > 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, > 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, > 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, > 2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, > 2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, > 2781136800,
[jira] [Created] (SPARK-31404) backward compatibility after switching to Proleptic Gregorian calendar
Wenchen Fan created SPARK-31404: --- Summary: backward compatibility after switching to Proleptic Gregorian calendar Key: SPARK-31404 URL: https://issues.apache.org/jira/browse/SPARK-31404 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but introduces some backward compatibility problems: 1. may read wrong data from the data files written by Spark 2.4 2. may have perf regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31359) Speed up timestamps rebasing
[ https://issues.apache.org/jira/browse/SPARK-31359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31359: --- Assignee: Maxim Gekk > Speed up timestamps rebasing > > > Key: SPARK-31359 > URL: https://issues.apache.org/jira/browse/SPARK-31359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Currently, rebasing of timestamps is performed via conversions to local > timestamps and back to microseconds. This is CPU intensive operation which > can be avoid by converting via pre-calculated tables per each time zone. For > example, the below is timestamps when diffs are changed in > America/Los_Angeles time zone for the range 0001-01-01...2100-01-01 > {code} > 0001-01-01T00:00 diff = -2872 minutes > 0100-03-01T00:00 diff = -1432 minutes > 0200-03-01T00:00 diff = 7 minutes > 0300-03-01T00:00 diff = 1447 minutes > 0500-03-01T00:00 diff = 2887 minutes > 0600-03-01T00:00 diff = 4327 minutes > 0700-03-01T00:00 diff = 5767 minutes > 0900-03-01T00:00 diff = 7207 minutes > 1000-03-01T00:00 diff = 8647 minutes > 1100-03-01T00:00 diff = 10087 minutes > 1300-03-01T00:00 diff = 11527 minutes > 1400-03-01T00:00 diff = 12967 minutes > 1500-03-01T00:00 diff = 14407 minutes > 1582-10-15T00:00 diff = 7 minutes > 1883-11-18T12:22:58 diff = 0 minutes > {code} > It seems it is possible to build rebasing maps, and perform rebasing via the > maps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31359) Speed up timestamps rebasing
[ https://issues.apache.org/jira/browse/SPARK-31359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31359. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28163 [https://github.com/apache/spark/pull/28163] > Speed up timestamps rebasing > > > Key: SPARK-31359 > URL: https://issues.apache.org/jira/browse/SPARK-31359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Currently, rebasing of timestamps is performed via conversions to local > timestamps and back to microseconds. This is CPU intensive operation which > can be avoid by converting via pre-calculated tables per each time zone. For > example, the below is timestamps when diffs are changed in > America/Los_Angeles time zone for the range 0001-01-01...2100-01-01 > {code} > 0001-01-01T00:00 diff = -2872 minutes > 0100-03-01T00:00 diff = -1432 minutes > 0200-03-01T00:00 diff = 7 minutes > 0300-03-01T00:00 diff = 1447 minutes > 0500-03-01T00:00 diff = 2887 minutes > 0600-03-01T00:00 diff = 4327 minutes > 0700-03-01T00:00 diff = 5767 minutes > 0900-03-01T00:00 diff = 7207 minutes > 1000-03-01T00:00 diff = 8647 minutes > 1100-03-01T00:00 diff = 10087 minutes > 1300-03-01T00:00 diff = 11527 minutes > 1400-03-01T00:00 diff = 12967 minutes > 1500-03-01T00:00 diff = 14407 minutes > 1582-10-15T00:00 diff = 7 minutes > 1883-11-18T12:22:58 diff = 0 minutes > {code} > It seems it is possible to build rebasing maps, and perform rebasing via the > maps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30808) Thrift server returns wrong timestamps/dates strings before 1582
[ https://issues.apache.org/jira/browse/SPARK-30808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-30808. - Resolution: Cannot Reproduce > Thrift server returns wrong timestamps/dates strings before 1582 > > > Key: SPARK-30808 > URL: https://issues.apache.org/jira/browse/SPARK-30808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Set the environment variable: > {code:bash} > export TZ="America/Los_Angeles" > ./bin/spark-sql -S > {code} > {code:sql} > spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; > spark.sql.session.timeZoneAmerica/Los_Angeles > spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); > 1001-01-01 00:07:02 > {code} > The expected result must be *1001-01-01 00:00:00*. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30808) Thrift server returns wrong timestamps/dates strings before 1582
[ https://issues.apache.org/jira/browse/SPARK-30808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080183#comment-17080183 ] Xiao Li commented on SPARK-30808: - The problem does not exist. > Thrift server returns wrong timestamps/dates strings before 1582 > > > Key: SPARK-30808 > URL: https://issues.apache.org/jira/browse/SPARK-30808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Set the environment variable: > {code:bash} > export TZ="America/Los_Angeles" > ./bin/spark-sql -S > {code} > {code:sql} > spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; > spark.sql.session.timeZoneAmerica/Los_Angeles > spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); > 1001-01-01 00:07:02 > {code} > The expected result must be *1001-01-01 00:00:00*. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30808) Thrift server returns wrong timestamps/dates strings before 1582
[ https://issues.apache.org/jira/browse/SPARK-30808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-30808: - > Thrift server returns wrong timestamps/dates strings before 1582 > > > Key: SPARK-30808 > URL: https://issues.apache.org/jira/browse/SPARK-30808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Set the environment variable: > {code:bash} > export TZ="America/Los_Angeles" > ./bin/spark-sql -S > {code} > {code:sql} > spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; > spark.sql.session.timeZoneAmerica/Los_Angeles > spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); > 1001-01-01 00:07:02 > {code} > The expected result must be *1001-01-01 00:00:00*. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat
[ https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080175#comment-17080175 ] zhengruifeng commented on SPARK-31301: -- (One small question: the new method takes a Dataset[_] rather than DataFrame; should it be DataFrame too?) I think so, we may need to change it to DataFrame; what is different about what they do then? I think existing two methods provide similar function, since for the first method, what we can do is just collecting it back to the driver, so I think it is similar to the second one. If we add another method (or change the return type of second method), then we can do more operations on it. Another point is that: If we return dataframe containing one row per feature, then we can avoid the bottleneck on the driver, because we no longer need to collect test results of all features back to the driver. I will test it and maybe send a PR to show it. > flatten the result dataframe of tests in stat > - > > Key: SPARK-31301 > URL: https://issues.apache.org/jira/browse/SPARK-31301 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Major > > {code:java} > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import > org.apache.spark.ml.stat.ChiSquareTest > import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq( > | (0.0, Vectors.dense(0.5, 10.0)), > | (0.0, Vectors.dense(1.5, 20.0)), > | (1.0, Vectors.dense(1.5, 30.0)), > | (0.0, Vectors.dense(3.5, 30.0)), > | (0.0, Vectors.dense(3.5, 40.0)), > | (1.0, Vectors.dense(3.5, 40.0)) > | ) > data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = > List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), > (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = > data.toDF("label", "features") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> >val chi = ChiSquareTest.test(df, "features", "label") > chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: > array ... 1 more field]scala> chi.show > +++--+ > | pValues|degreesOfFreedom|statistics| > +++--+ > |[0.68728927879097...| [2, 3]|[0.75,1.5]| > +++--+{code} > > Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, > {{Correlation}} all return a df only containing one row. > I think this is quite hard to use, suppose we have a dataset with dim=1000, > the only operation we can deal with the test result is to collect it by > {{head()}} or {{first(), and then use it in the driver.}} > {{While what I really want to do is filtering the df like pValue>0.1}} or > {{corr<0.5}}, *So I suggest to flatten the output df in those tests.* > > {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but > ChiSquareTest and Correlation were here for a long time. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31355) Document TABLESAMPLE in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-31355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31355. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28130 [https://github.com/apache/spark/pull/28130] > Document TABLESAMPLE in SQL Reference > - > > Key: SPARK-31355 > URL: https://issues.apache.org/jira/browse/SPARK-31355 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Document TABLESAMPLE in SQL Reference -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31355) Document TABLESAMPLE in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-31355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-31355: Assignee: Huaxin Gao > Document TABLESAMPLE in SQL Reference > - > > Key: SPARK-31355 > URL: https://issues.apache.org/jira/browse/SPARK-31355 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Document TABLESAMPLE in SQL Reference -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30819) Add FMRegressor wrapper to SparkR
[ https://issues.apache.org/jira/browse/SPARK-30819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30819. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27571 [https://github.com/apache/spark/pull/27571] > Add FMRegressor wrapper to SparkR > - > > Key: SPARK-30819 > URL: https://issues.apache.org/jira/browse/SPARK-30819 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Spark should provide a wrapper for {{o.a.s.ml.regression. FMRegressor}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30819) Add FMRegressor wrapper to SparkR
[ https://issues.apache.org/jira/browse/SPARK-30819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30819: Assignee: Maciej Szymkiewicz > Add FMRegressor wrapper to SparkR > - > > Key: SPARK-30819 > URL: https://issues.apache.org/jira/browse/SPARK-30819 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > Spark should provide a wrapper for {{o.a.s.ml.regression. FMRegressor}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30808) Thrift server returns wrong timestamps/dates strings before 1582
[ https://issues.apache.org/jira/browse/SPARK-30808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080102#comment-17080102 ] Xiao Li commented on SPARK-30808: - Is it reverted? > Thrift server returns wrong timestamps/dates strings before 1582 > > > Key: SPARK-30808 > URL: https://issues.apache.org/jira/browse/SPARK-30808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Set the environment variable: > {code:bash} > export TZ="America/Los_Angeles" > ./bin/spark-sql -S > {code} > {code:sql} > spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; > spark.sql.session.timeZoneAmerica/Los_Angeles > spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); > 1001-01-01 00:07:02 > {code} > The expected result must be *1001-01-01 00:00:00*. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31403) TreeNode asCode function incorrectly handles null literals
Carl Sverre created SPARK-31403: --- Summary: TreeNode asCode function incorrectly handles null literals Key: SPARK-31403 URL: https://issues.apache.org/jira/browse/SPARK-31403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: Carl Sverre In the TreeNode code in Catalyst the asCode function incorrectly handles null literals. When it tries to render a null literal it will match {{null}} using the third case expression and try to call {{null.toString}} which will raise a NullPointerException. I verified this bug exists in Spark 2.4.4 and the same code appears to be in master: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L707] The fix seems trivial - add an explicit case for null. One way to reproduce this is via: {code:java} val plan = spark .sql("select if(isnull(id), null, 2) from testdb_jdbc.users") .queryExecution .optimizedPlan println(plan.asInstanceOf[Project].projectList.head.asCode) {code} However any other way which generates a Literal with the value null will cause the issue. In this case the above SparkSQL will generate the literal: {{Literal(null, IntegerType)}} for the "trueValue" of the if statement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31399) closure cleaner is broken in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31399: -- Priority: Blocker (was: Major) > closure cleaner is broken in Spark 3.0 > -- > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31399) closure cleaner is broken in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31399: -- Target Version/s: 3.0.0 > closure cleaner is broken in Spark 3.0 > -- > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31399) closure cleaner is broken in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079992#comment-17079992 ] Dongjoon Hyun commented on SPARK-31399: --- Thank you for pinging me, [~cloud_fan]. cc [~dbtsai] and [~holden] > closure cleaner is broken in Spark 3.0 > -- > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31402) Incorrect rebasing of BCE dates
Maxim Gekk created SPARK-31402: -- Summary: Incorrect rebasing of BCE dates Key: SPARK-31402 URL: https://issues.apache.org/jira/browse/SPARK-31402 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Dates of before common era are rebased incorrectly, see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120679/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql/ {code} sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: postgreSQL/date.sql Expected "[-0044]-03-15", but got "[0045]-03-15" Result did not match for query #93 select make_date(-44, 3, 15) {code} Even such dates are out of the valid range of dates supported by the DATE type, there is a test in postgreSQL/date.sql for a negative year, and it would be nice to fix the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31401) Add JDK11 example in `bin/docker-image-tool.sh` usage
[ https://issues.apache.org/jira/browse/SPARK-31401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31401: - Assignee: Dongjoon Hyun > Add JDK11 example in `bin/docker-image-tool.sh` usage > - > > Key: SPARK-31401 > URL: https://issues.apache.org/jira/browse/SPARK-31401 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > {code:java} > $ bin/docker-image-tool.sh -h > ... > - Build and push JDK11-based image with tag "v3.0.0" to docker.io/myrepo > bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 -b > java_image_tag=11-jre-slim build > bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 push {code} > > {code:java} > $ docker run -it --rm docker.io/myrepo/spark:v3.0.0 java --version | tail -n3 > openjdk 11.0.6 2020-01-14 > OpenJDK Runtime Environment 18.9 (build 11.0.6+10) > OpenJDK 64-Bit Server VM 18.9 (build 11.0.6+10, mixed mode) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31401) Add JDK11 example in `bin/docker-image-tool.sh` usage
Dongjoon Hyun created SPARK-31401: - Summary: Add JDK11 example in `bin/docker-image-tool.sh` usage Key: SPARK-31401 URL: https://issues.apache.org/jira/browse/SPARK-31401 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 3.0.0 Reporter: Dongjoon Hyun {code:java} $ bin/docker-image-tool.sh -h ... - Build and push JDK11-based image with tag "v3.0.0" to docker.io/myrepo bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 -b java_image_tag=11-jre-slim build bin/docker-image-tool.sh -r docker.io/myrepo -t v3.0.0 push {code} {code:java} $ docker run -it --rm docker.io/myrepo/spark:v3.0.0 java --version | tail -n3 openjdk 11.0.6 2020-01-14 OpenJDK Runtime Environment 18.9 (build 11.0.6+10) OpenJDK 64-Bit Server VM 18.9 (build 11.0.6+10, mixed mode) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31377) Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite
[ https://issues.apache.org/jira/browse/SPARK-31377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinivas Rishindra Pothireddi updated SPARK-31377: -- Description: For some combinations of join algorithm and join types there are no unit tests for the "number of output rows" metric. A list of missing unit tests include the following. * SortMergeJoin: ExistenceJoin * ShuffledHashJoin: OuterJoin, leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin * BroadcastHashJoin: LeftAnti, ExistenceJoin was: For some combinations of join algorithm and join types there are no unit tests for the "number of output rows" metric. A list of missing unit tests include the following. * SortMergeJoin: ExistenceJoin * ShuffledHashJoin: OuterJoin, ReftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin * BroadcastHashJoin: LeftAnti, ExistenceJoin > Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite > -- > > Key: SPARK-31377 > URL: https://issues.apache.org/jira/browse/SPARK-31377 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Minor > > For some combinations of join algorithm and join types there are no unit > tests for the "number of output rows" metric. > A list of missing unit tests include the following. > * SortMergeJoin: ExistenceJoin > * ShuffledHashJoin: OuterJoin, leftOuter, RightOuter, LeftAnti, LeftSemi, > ExistenseJoin > * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin > * BroadcastHashJoin: LeftAnti, ExistenceJoin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31400) The catalogString doesn't distinguish Vectors in ml and mllib
Junpei Zhou created SPARK-31400: --- Summary: The catalogString doesn't distinguish Vectors in ml and mllib Key: SPARK-31400 URL: https://issues.apache.org/jira/browse/SPARK-31400 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.4.5 Environment: Ubuntu 16.04 Reporter: Junpei Zhou h2. Bug Description The `catalogString` is not detailed enough to distinguish the pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. h2. How to reproduce the bug [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an example from the official document (Python code). If I keep all other lines untouched, and only modify the Vectors import line, which means: {code:java} # from pyspark.ml.linalg import Vectors from pyspark.mllib.linalg import Vectors {code} Or you can directly execute the following code snippet: {code:java} from pyspark.ml.feature import MinMaxScaler # from pyspark.ml.linalg import Vectors from pyspark.mllib.linalg import Vectors dataFrame = spark.createDataFrame([ (0, Vectors.dense([1.0, 0.1, -1.0]),), (1, Vectors.dense([2.0, 1.1, 1.0]),), (2, Vectors.dense([3.0, 10.1, 3.0]),) ], ["id", "features"]) scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures") scalerModel = scaler.fit(dataFrame) {code} It will raise an error: {code:java} IllegalArgumentException: 'requirement failed: Column features must be of type struct,values:array> but was actually struct,values:array>.' {code} However, the actually struct and the desired struct are exactly the same string, which cannot provide useful information to the programmer. I would suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31400) The catalogString doesn't distinguish Vectors in ml and mllib
[ https://issues.apache.org/jira/browse/SPARK-31400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junpei Zhou updated SPARK-31400: Description: h2. Bug Description The `catalogString` is not detailed enough to distinguish the pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. h2. How to reproduce the bug [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an example from the official document (Python code). If I keep all other lines untouched, and only modify the Vectors import line, which means: {code:java} # from pyspark.ml.linalg import Vectors from pyspark.mllib.linalg import Vectors {code} Or you can directly execute the following code snippet: {code:java} from pyspark.ml.feature import MinMaxScaler # from pyspark.ml.linalg import Vectors from pyspark.mllib.linalg import Vectors dataFrame = spark.createDataFrame([ (0, Vectors.dense([1.0, 0.1, -1.0]),), (1, Vectors.dense([2.0, 1.1, 1.0]),), (2, Vectors.dense([3.0, 10.1, 3.0]),) ], ["id", "features"]) scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures") scalerModel = scaler.fit(dataFrame) {code} It will raise an error: {code:java} IllegalArgumentException: 'requirement failed: Column features must be of type struct,values:array> but was actually struct,values:array>.' {code} However, the actually struct and the desired struct are exactly the same string, which cannot provide useful information to the programmer. I would suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. Thanks! was: h2. Bug Description The `catalogString` is not detailed enough to distinguish the pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. h2. How to reproduce the bug [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an example from the official document (Python code). If I keep all other lines untouched, and only modify the Vectors import line, which means: {code:java} # from pyspark.ml.linalg import Vectors from pyspark.mllib.linalg import Vectors {code} Or you can directly execute the following code snippet: {code:java} from pyspark.ml.feature import MinMaxScaler # from pyspark.ml.linalg import Vectors from pyspark.mllib.linalg import Vectors dataFrame = spark.createDataFrame([ (0, Vectors.dense([1.0, 0.1, -1.0]),), (1, Vectors.dense([2.0, 1.1, 1.0]),), (2, Vectors.dense([3.0, 10.1, 3.0]),) ], ["id", "features"]) scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures") scalerModel = scaler.fit(dataFrame) {code} It will raise an error: {code:java} IllegalArgumentException: 'requirement failed: Column features must be of type struct,values:array> but was actually struct,values:array>.' {code} However, the actually struct and the desired struct are exactly the same string, which cannot provide useful information to the programmer. I would suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. Thanks! > The catalogString doesn't distinguish Vectors in ml and mllib > - > > Key: SPARK-31400 > URL: https://issues.apache.org/jira/browse/SPARK-31400 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.4.5 > Environment: Ubuntu 16.04 >Reporter: Junpei Zhou >Priority: Major > > h2. Bug Description > The `catalogString` is not detailed enough to distinguish the > pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. > h2. How to reproduce the bug > [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an > example from the official document (Python code). If I keep all other lines > untouched, and only modify the Vectors import line, which means: > {code:java} > # from pyspark.ml.linalg import Vectors > from pyspark.mllib.linalg import Vectors > {code} > Or you can directly execute the following code snippet: > {code:java} > from pyspark.ml.feature import MinMaxScaler > # from pyspark.ml.linalg import Vectors > from pyspark.mllib.linalg import Vectors > dataFrame = spark.createDataFrame([ > (0, Vectors.dense([1.0, 0.1, -1.0]),), > (1, Vectors.dense([2.0, 1.1, 1.0]),), > (2, Vectors.dense([3.0, 10.1, 3.0]),) > ], ["id", "features"]) > scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures") > scalerModel = scaler.fit(dataFrame) > {code} > It will raise an error: > {code:java} > IllegalArgumentException: 'requirement failed: Column features must be of > type struct,values:array> > but was actually > struct,values:array>.' > {code} > However, the actually struct and the desired struct are exactly the same > string, which cannot provide useful information to the programmer. I would > suggest making the catalogString
[jira] [Resolved] (SPARK-31331) Document Spark integration with Hive UDFs/UDAFs/UDTFs
[ https://issues.apache.org/jira/browse/SPARK-31331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31331. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28104 [https://github.com/apache/spark/pull/28104] > Document Spark integration with Hive UDFs/UDAFs/UDTFs > - > > Key: SPARK-31331 > URL: https://issues.apache.org/jira/browse/SPARK-31331 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > Document Spark integration with Hive UDFs/UDAFs/UDTFs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31331) Document Spark integration with Hive UDFs/UDAFs/UDTFs
[ https://issues.apache.org/jira/browse/SPARK-31331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-31331: Assignee: Huaxin Gao > Document Spark integration with Hive UDFs/UDAFs/UDTFs > - > > Key: SPARK-31331 > URL: https://issues.apache.org/jira/browse/SPARK-31331 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > Document Spark integration with Hive UDFs/UDAFs/UDTFs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31389) Ensure all tests in SQLMetricsSuite run with both codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinivas Rishindra Pothireddi updated SPARK-31389: -- Description: Many tests in SQLMetricsSuite run only with codegen turned off. Some complex code paths (for example, generated code in "SortMergeJoin metrics") aren't exercised at all. The generated code should be tested as well. *List of tests that run with codegen off* Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, BroadcastHashJoin metrics, ShuffledHashJoin metrics, BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics, CartesianProduct metrics, SortMergeJoin(left-anti) metrics was: Many tests in SQLMetricsSuite run only with codegen turned off. Some complex code paths (for example, generated code in "SortMergeJoin metrics") aren't exercised at all. The generated code should be tested as well. *List of tests that run with codegen off* Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, BroadcastHashJoin metrics, ShuffledHashJoin metrics, BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics > Ensure all tests in SQLMetricsSuite run with both codegen on and off > > > Key: SPARK-31389 > URL: https://issues.apache.org/jira/browse/SPARK-31389 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Minor > > Many tests in SQLMetricsSuite run only with codegen turned off. Some complex > code paths (for example, generated code in "SortMergeJoin metrics") aren't > exercised at all. The generated code should be tested as well. > *List of tests that run with codegen off* > Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, > BroadcastHashJoin metrics, ShuffledHashJoin metrics, > BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, > BroadcastLeftSemiJoinHash metrics, CartesianProduct metrics, > SortMergeJoin(left-anti) metrics > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31021) Support MariaDB Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-31021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-31021. Fix Version/s: 3.1.0 Assignee: Gabor Somogyi Resolution: Fixed > Support MariaDB Kerberos login in JDBC connector > > > Key: SPARK-31021 > URL: https://issues.apache.org/jira/browse/SPARK-31021 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31399) closure cleaner is broken in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079514#comment-17079514 ] Xiao Li commented on SPARK-31399: - cc [~zsxwing] > closure cleaner is broken in Spark 3.0 > -- > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31399) closure cleaner is broken in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079454#comment-17079454 ] Wenchen Fan commented on SPARK-31399: - cc [~rxin] [~sowen] [~dongjoon] [~tgraves] > closure cleaner is broken in Spark 3.0 > -- > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31399) closure cleaner is broken in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31399: Description: The `ClosureCleaner` only support Scala functions and it uses the following check to catch closures {code} // Check whether a class represents a Scala closure private def isClosure(cls: Class[_]): Boolean = { cls.getName.contains("$anonfun$") } {code} This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala functions become Java lambdas. As an example, the following code works well in Spark 2.4 Spark Shell: {code} scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.functions.lit case class Foo(id: String) val col = lit("123") val df = sc.range(0,10,1,1).map { _ => Foo("") } // Exiting paste mode, now interpreting. import org.apache.spark.sql.functions.lit defined class Foo col: org.apache.spark.sql.Column = 123 df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 {code} But fails in 3.0 {code} scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.functions.lit case class Foo(id: String) val col = lit("123") val df = sc.range(0,10,1,1).map { _ => Foo("") } // Exiting paste mode, now interpreting. org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.map(RDD.scala:421) ... 39 elided Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column Serialization stack: - object not serializable (class: org.apache.spark.sql.Column, value: 123) - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) - object (class $iw, $iw@2d87ac2b) - element of array (index: 0) - array (class [Ljava.lang.Object;, size 1) - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;) - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class $iw, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) - writeReplace data (class: java.lang.invoke.SerializedLambda) - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) ... 47 more {code} was: The `ClosureCleaner` only support Scala functions and it uses the following check to catch closures {code} // Check whether a class represents a Scala closure private def isClosure(cls: Class[_]): Boolean = { cls.getName.contains("$anonfun$") } {code} This doesn't work in 3.0 anyway as we upgrade to Scala 2.12 and most Scala functions become Java lambdas. As an example, the following code works well in Spark 2.4 Spark Shell: {code} scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.functions.lit case class Foo(id: String) val col = lit("123") val df = sc.range(0,10,1,1).map { _ => Foo("") } // Exiting paste mode, now interpreting. import org.apache.spark.sql.functions.lit defined class Foo col: org.apache.spark.sql.Column = 123 df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 {code} But fails in 3.0 {code} scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.functions.lit case class Foo(id: String) val col = lit("123") val df = sc.range(0,10,1,1).map { _ => Foo("") } // Exiting paste mode, now interpreting. org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) at
[jira] [Created] (SPARK-31399) closure cleaner is broken in Spark 3.0
Wenchen Fan created SPARK-31399: --- Summary: closure cleaner is broken in Spark 3.0 Key: SPARK-31399 URL: https://issues.apache.org/jira/browse/SPARK-31399 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Wenchen Fan The `ClosureCleaner` only support Scala functions and it uses the following check to catch closures {code} // Check whether a class represents a Scala closure private def isClosure(cls: Class[_]): Boolean = { cls.getName.contains("$anonfun$") } {code} This doesn't work in 3.0 anyway as we upgrade to Scala 2.12 and most Scala functions become Java lambdas. As an example, the following code works well in Spark 2.4 Spark Shell: {code} scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.functions.lit case class Foo(id: String) val col = lit("123") val df = sc.range(0,10,1,1).map { _ => Foo("") } // Exiting paste mode, now interpreting. import org.apache.spark.sql.functions.lit defined class Foo col: org.apache.spark.sql.Column = 123 df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 {code} But fails in 3.0 {code} scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.functions.lit case class Foo(id: String) val col = lit("123") val df = sc.range(0,10,1,1).map { _ => Foo("") } // Exiting paste mode, now interpreting. org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.map(RDD.scala:421) ... 39 elided Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column Serialization stack: - object not serializable (class: org.apache.spark.sql.Column, value: 123) - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) - object (class $iw, $iw@2d87ac2b) - element of array (index: 0) - array (class [Ljava.lang.Object;, size 1) - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;) - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class $iw, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) - writeReplace data (class: java.lang.invoke.SerializedLambda) - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) ... 47 more {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat
[ https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079371#comment-17079371 ] Sean R. Owen commented on SPARK-31301: -- (One small question: the new method takes a Dataset[_] rather than DataFrame; should it be DataFrame too?) The new method's purpose seemed to be more about about providing the result as objects rather than a DataFrame. The proposal would require making the new method also return a DataFrame, and then we have two very similar methods; what is different about what they do then, for the caller looking at the API? > flatten the result dataframe of tests in stat > - > > Key: SPARK-31301 > URL: https://issues.apache.org/jira/browse/SPARK-31301 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Major > > {code:java} > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import > org.apache.spark.ml.stat.ChiSquareTest > import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq( > | (0.0, Vectors.dense(0.5, 10.0)), > | (0.0, Vectors.dense(1.5, 20.0)), > | (1.0, Vectors.dense(1.5, 30.0)), > | (0.0, Vectors.dense(3.5, 30.0)), > | (0.0, Vectors.dense(3.5, 40.0)), > | (1.0, Vectors.dense(3.5, 40.0)) > | ) > data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = > List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), > (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = > data.toDF("label", "features") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> >val chi = ChiSquareTest.test(df, "features", "label") > chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: > array ... 1 more field]scala> chi.show > +++--+ > | pValues|degreesOfFreedom|statistics| > +++--+ > |[0.68728927879097...| [2, 3]|[0.75,1.5]| > +++--+{code} > > Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, > {{Correlation}} all return a df only containing one row. > I think this is quite hard to use, suppose we have a dataset with dim=1000, > the only operation we can deal with the test result is to collect it by > {{head()}} or {{first(), and then use it in the driver.}} > {{While what I really want to do is filtering the df like pValue>0.1}} or > {{corr<0.5}}, *So I suggest to flatten the output df in those tests.* > > {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but > ChiSquareTest and Correlation were here for a long time. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer
[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079337#comment-17079337 ] Everett Rush commented on SPARK-27249: -- Hi Nick, I think the solution could make use of the mapPartitions method, but still there should be a multicolumn transformer class. The unary transformer is very limiting. > Developers API for Transformers beyond UnaryTransformer > --- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.1.0 >Reporter: Everett Rush >Priority: Minor > Labels: starter > Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png > > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > and outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31398) Speed up reading dates in ORC
Maxim Gekk created SPARK-31398: -- Summary: Speed up reading dates in ORC Key: SPARK-31398 URL: https://issues.apache.org/jira/browse/SPARK-31398 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, ORC datasource converts values of DATE type to java.sql.Date and the result to days since the epoch in Proleptic Gregorian calendar. ORC datasource does such conversion when spark.sql.orc.enableVectorizedReader is set to false. The conversion to java.sql.Date is not necessary because we can use DaysWritable which performs rebasing in much more optimal way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31397) Support json_arrayAgg
[ https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31397. -- Resolution: Won't Fix > Support json_arrayAgg > - > > Key: SPARK-31397 > URL: https://issues.apache.org/jira/browse/SPARK-31397 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON array by aggregating all the JSON arrays from a set of JSON > arrays, or by aggregating the values of a Column. > Some of the Databases supporting this aggregate function are: > * MySQL > * PostgreSQL > * Maria_DB > * Sqlite > * IBM Db2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31396) Support json_objectAgg function
[ https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31396. -- Resolution: Won't Fix > Support json_objectAgg function > --- > > Key: SPARK-31396 > URL: https://issues.apache.org/jira/browse/SPARK-31396 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON object containing the key-value pairs by aggregating the > key-values of set of Objects or columns. > > This aggregate function is supported by: > * MySQL > * PostgreSQL > * IBM Db2 > * Maria_DB > * Sqlite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31397) Support json_arrayAgg
[ https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079268#comment-17079268 ] Rakesh Raushan commented on SPARK-31397: The same functionality can be achieved using to/from_json and performance will be almost equivalent as well. So we do not need to implement this new function. > Support json_arrayAgg > - > > Key: SPARK-31397 > URL: https://issues.apache.org/jira/browse/SPARK-31397 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON array by aggregating all the JSON arrays from a set of JSON > arrays, or by aggregating the values of a Column. > Some of the Databases supporting this aggregate function are: > * MySQL > * PostgreSQL > * Maria_DB > * Sqlite > * IBM Db2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31397) Support json_arrayAgg
[ https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31397: --- Comment: was deleted (was: I am working on it.) > Support json_arrayAgg > - > > Key: SPARK-31397 > URL: https://issues.apache.org/jira/browse/SPARK-31397 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON array by aggregating all the JSON arrays from a set of JSON > arrays, or by aggregating the values of a Column. > Some of the Databases supporting this aggregate function are: > * MySQL > * PostgreSQL > * Maria_DB > * Sqlite > * IBM Db2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31396) Support json_objectAgg function
[ https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079267#comment-17079267 ] Rakesh Raushan commented on SPARK-31396: We can achieve the same functionality using to/from_json. So we do not need this new function. > Support json_objectAgg function > --- > > Key: SPARK-31396 > URL: https://issues.apache.org/jira/browse/SPARK-31396 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON object containing the key-value pairs by aggregating the > key-values of set of Objects or columns. > > This aggregate function is supported by: > * MySQL > * PostgreSQL > * IBM Db2 > * Maria_DB > * Sqlite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31396) Support json_objectAgg function
[ https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31396: --- Comment: was deleted (was: I am working on it.) > Support json_objectAgg function > --- > > Key: SPARK-31396 > URL: https://issues.apache.org/jira/browse/SPARK-31396 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON object containing the key-value pairs by aggregating the > key-values of set of Objects or columns. > > This aggregate function is supported by: > * MySQL > * PostgreSQL > * IBM Db2 > * Maria_DB > * Sqlite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31291) SQLQueryTestSuite: Sharing test data and test tables among multiple test cases.
[ https://issues.apache.org/jira/browse/SPARK-31291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31291: --- Assignee: jiaan.geng > SQLQueryTestSuite: Sharing test data and test tables among multiple test > cases. > --- > > Key: SPARK-31291 > URL: https://issues.apache.org/jira/browse/SPARK-31291 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Minor > > SQLQueryTestSuite spend 35 minutes time to test. > I checked the code and found SQLQueryTestSuite load test data repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31291) SQLQueryTestSuite: Sharing test data and test tables among multiple test cases.
[ https://issues.apache.org/jira/browse/SPARK-31291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31291. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28060 [https://github.com/apache/spark/pull/28060] > SQLQueryTestSuite: Sharing test data and test tables among multiple test > cases. > --- > > Key: SPARK-31291 > URL: https://issues.apache.org/jira/browse/SPARK-31291 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Minor > Fix For: 3.0.0 > > > SQLQueryTestSuite spend 35 minutes time to test. > I checked the code and found SQLQueryTestSuite load test data repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31395) [SPARK][Core]preferred location causing single node be a hot spot
[ https://issues.apache.org/jira/browse/SPARK-31395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31395. -- Resolution: Not A Problem > [SPARK][Core]preferred location causing single node be a hot spot > - > > Key: SPARK-31395 > URL: https://issues.apache.org/jira/browse/SPARK-31395 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.4, 2.4.5 >Reporter: StephenZou >Priority: Minor > > my job is run as follows: > # build model, model is saved in HDFS named prob1. prob2. probN > # then load it to RDD from certain ProbInputformat > # do some calculation > The driver node which builds the model is scheduled more frequently than > other nodes, because the HDFS block is firstly written to itself. The > scheduling hot spot is unnecessary, and should better be flattened. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31397) Support json_arrayAgg
[ https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079226#comment-17079226 ] Rakesh Raushan commented on SPARK-31397: I am working on it. > Support json_arrayAgg > - > > Key: SPARK-31397 > URL: https://issues.apache.org/jira/browse/SPARK-31397 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON array by aggregating all the JSON arrays from a set of JSON > arrays, or by aggregating the values of a Column. > Some of the Databases supporting this aggregate function are: > * MySQL > * PostgreSQL > * Maria_DB > * Sqlite > * IBM Db2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31397) Support json_arrayAgg
Rakesh Raushan created SPARK-31397: -- Summary: Support json_arrayAgg Key: SPARK-31397 URL: https://issues.apache.org/jira/browse/SPARK-31397 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Rakesh Raushan Returns a JSON array by aggregating all the JSON arrays from a set of JSON arrays, or by aggregating the values of a Column. Some of the Databases supporting this aggregate function are: * MySQL * PostgreSQL * Maria_DB * Sqlite * IBM Db2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31396) Support json_objectAgg function
[ https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079198#comment-17079198 ] Rakesh Raushan commented on SPARK-31396: I am working on it. > Support json_objectAgg function > --- > > Key: SPARK-31396 > URL: https://issues.apache.org/jira/browse/SPARK-31396 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON object containing the key-value pairs by aggregating the > key-values of set of Objects or columns. > > This aggregate function is supported by: > * MySQL > * PostgreSQL > * IBM Db2 > * Maria_DB > * Sqlite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31396) Support json_objectAgg function
Rakesh Raushan created SPARK-31396: -- Summary: Support json_objectAgg function Key: SPARK-31396 URL: https://issues.apache.org/jira/browse/SPARK-31396 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Rakesh Raushan Returns a JSON object containing the key-value pairs by aggregating the key-values of set of Objects or columns. This aggregate function is supported by: * MySQL * PostgreSQL * IBM Db2 * Maria_DB * Sqlite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31395) [SPARK][Core]preferred location causing single node be a hot spot
StephenZou created SPARK-31395: -- Summary: [SPARK][Core]preferred location causing single node be a hot spot Key: SPARK-31395 URL: https://issues.apache.org/jira/browse/SPARK-31395 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.5, 2.3.4 Reporter: StephenZou my job is run as follows: # build model, model is saved in HDFS named prob1. prob2. probN # then load it to RDD from certain ProbInputformat # do some calculation The driver node which builds the model is scheduled more frequently than other nodes, because the HDFS block is firstly written to itself. The scheduling hot spot is unnecessary, and should better be flattened. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31106) Support is_json function
[ https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31106: --- Description: This function will allow users to verify whether the given string is valid JSON or not. It returns `true` for valid JSON and `false` for invalid JSON. `NULL` is returned for `NULL` input. DBMSs supporting this functions are : * MySQL * SQL Server * Sqlite * MariaDB * Amazon Redshift * IBM Db2 was: Currently, null is returned when we come across invalid json. We should either throw an exception for invalid json or false should be returned, like in other DBMSs. Like in `json_array_length` function we need to return NULL for null array. So this might confuse users. DBMSs supporting this functions are : * MySQL * SQL Server * Sqlite * MariaDB * Amazon Redshift > Support is_json function > > > Key: SPARK-31106 > URL: https://issues.apache.org/jira/browse/SPARK-31106 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > This function will allow users to verify whether the given string is valid > JSON or not. It returns `true` for valid JSON and `false` for invalid JSON. > `NULL` is returned for `NULL` input. > DBMSs supporting this functions are : > * MySQL > * SQL Server > * Sqlite > * MariaDB > * Amazon Redshift > * IBM Db2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31106) Support is_json function
[ https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31106: --- Summary: Support is_json function (was: Support IS_JSON) > Support is_json function > > > Key: SPARK-31106 > URL: https://issues.apache.org/jira/browse/SPARK-31106 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Currently, null is returned when we come across invalid json. We should > either throw an exception for invalid json or false should be returned, like > in other DBMSs. Like in `json_array_length` function we need to return NULL > for null array. So this might confuse users. > > DBMSs supporting this functions are : > * MySQL > * SQL Server > * Sqlite > * MariaDB > * Amazon Redshift -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python
[ https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079165#comment-17079165 ] chiranjeevi commented on SPARK-28902: - Hi team, i am having the same issue, May i know how can this be fixed ? > Spark ML Pipeline with nested Pipelines fails to load when saved from Python > > > Key: SPARK-28902 > URL: https://issues.apache.org/jira/browse/SPARK-28902 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Saif Addin >Priority: Minor > > Hi, this error is affecting a bunch of our nested use cases. > Saving a *PipelineModel* with one of its stages being another > *PipelineModel*, fails when loading it from Scala if it is saved in Python. > *Python side:* > > {code:java} > from pyspark.ml import Pipeline > from pyspark.ml.feature import Tokenizer > t = Tokenizer() > p = Pipeline().setStages([t]) > d = spark.createDataFrame([["Hello Peter Parker"]]) > pm = p.fit(d) > np = Pipeline().setStages([pm]) > npm = np.fit(d) > npm.write().save('./npm_test') > {code} > > > *Scala side:* > > {code:java} > scala> import org.apache.spark.ml.PipelineModel > scala> val pp = PipelineModel.load("./npm_test") > java.lang.IllegalArgumentException: requirement failed: Error loading > metadata: Expected class name org.apache.spark.ml.PipelineModel but found > class name pyspark.ml.pipeline.PipelineModel > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638) > at > org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616) > at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342) > at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380) > at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332) > ... 50 elided > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31393) Show the correct alias in a more elegant way
[ https://issues.apache.org/jira/browse/SPARK-31393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-31393: --- Description: Spark SQL exists some function no elegant implementation alias. For example: BitwiseCount override the sql method {code:java} override def sql: String = s"bit_count(${child.sql})" {code} I don't think it's elegant enough. Because `Expression` gives the following definitions. {code:java} def sql: String = { val childrenSQL = children.map(_.sql).mkString(", ") s"$prettyName($childrenSQL)" } {code} By this definition, BitwiseCount should override `prettyName` method. was: Spark SQL exists some function no elegant implementation alias. For example: BitwiseCount override the sql method override def sql: String = s"bit_count(${child.sql})" I don't think it's elegant enough. Because `Expression` gives the following definitions. ``` def sql: String = { val childrenSQL = children.map(_.sql).mkString(", ") s"$prettyName($childrenSQL)" } ``` By this definition, BitwiseCount should override `prettyName` method. > Show the correct alias in a more elegant way > > > Key: SPARK-31393 > URL: https://issues.apache.org/jira/browse/SPARK-31393 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Spark SQL exists some function no elegant implementation alias. > For example: BitwiseCount override the sql method > {code:java} > override def sql: String = s"bit_count(${child.sql})" > {code} > I don't think it's elegant enough. > Because `Expression` gives the following definitions. > {code:java} > def sql: String = { > val childrenSQL = children.map(_.sql).mkString(", ") > s"$prettyName($childrenSQL)" > } > {code} > By this definition, BitwiseCount should override `prettyName` method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31106) Support IS_JSON
[ https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31106: --- Description: Currently, null is returned when we come across invalid json. We should either throw an exception for invalid json or false should be returned, like in other DBMSs. Like in `json_array_length` function we need to return NULL for null array. So this might confuse users. DBMSs supporting this functions are : * MySQL * SQL Server * Sqlite * MariaDB * Amazon Redshift was: Currently, null is returned when we come across invalid json. We should either throw an exception for invalid json or false should be returned, like in other DBMSs. Like in `json_array_length` function we need to return NULL for null array. So this might confuse users. DBMSs supporting this functions are : * MySQL * SQL Server * Sqlite * MariaDB > Support IS_JSON > --- > > Key: SPARK-31106 > URL: https://issues.apache.org/jira/browse/SPARK-31106 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Currently, null is returned when we come across invalid json. We should > either throw an exception for invalid json or false should be returned, like > in other DBMSs. Like in `json_array_length` function we need to return NULL > for null array. So this might confuse users. > > DBMSs supporting this functions are : > * MySQL > * SQL Server > * Sqlite > * MariaDB > * Amazon Redshift -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31291) SQLQueryTestSuite: Sharing test data and test tables among multiple test cases.
[ https://issues.apache.org/jira/browse/SPARK-31291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-31291: --- Summary: SQLQueryTestSuite: Sharing test data and test tables among multiple test cases. (was: SQLQueryTestSuite: Avoid load test data if test case not uses them.) > SQLQueryTestSuite: Sharing test data and test tables among multiple test > cases. > --- > > Key: SPARK-31291 > URL: https://issues.apache.org/jira/browse/SPARK-31291 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Minor > > SQLQueryTestSuite spend 35 minutes time to test. > I checked the code and found SQLQueryTestSuite load test data repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31394) Support for Kubernetes NFS volume mounts
Seongjin Cho created SPARK-31394: Summary: Support for Kubernetes NFS volume mounts Key: SPARK-31394 URL: https://issues.apache.org/jira/browse/SPARK-31394 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3 Reporter: Seongjin Cho Kubernetes supports various kinds of volumes, but Spark for Kubernetes supports only EmptyDir/HostDir/PVC. By adding support for NFS, we can use Spark for Kubernetes with NFS storage. NFS could be used using PVC when we want to use some clean new empty disk space, but in order to use files in existing NFS shares, we need to use NFS volume mounts. RP link: [https://github.com/apache/spark/pull/27364] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31393) Show the correct alias in a more elegant way
jiaan.geng created SPARK-31393: -- Summary: Show the correct alias in a more elegant way Key: SPARK-31393 URL: https://issues.apache.org/jira/browse/SPARK-31393 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: jiaan.geng Spark SQL exists some function no elegant implementation alias. For example: BitwiseCount override the sql method override def sql: String = s"bit_count(${child.sql})" I don't think it's elegant enough. Because `Expression` gives the following definitions. ``` def sql: String = { val childrenSQL = children.map(_.sql).mkString(", ") s"$prettyName($childrenSQL)" } ``` By this definition, BitwiseCount should override `prettyName` method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31392) Support CalendarInterval to be reflect to CalendarIntervalType
Kent Yao created SPARK-31392: Summary: Support CalendarInterval to be reflect to CalendarIntervalType Key: SPARK-31392 URL: https://issues.apache.org/jira/browse/SPARK-31392 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Kent Yao Since Spark 3.0.0, we make CalendarInterval public, it's better for it to be inferred to CalendarIntervalType -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31391) Add AdaptiveTestUtils to ease the test of AQE
wuyi created SPARK-31391: Summary: Add AdaptiveTestUtils to ease the test of AQE Key: SPARK-31391 URL: https://issues.apache.org/jira/browse/SPARK-31391 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: wuyi Tests related to AQE now have much duplicate codes, we can use some utility functions to make the test simpler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat
[ https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078958#comment-17078958 ] zhengruifeng commented on SPARK-31301: -- [~srowen] There are two methods now: {code:java} @Since("2.2.0") def test(dataset: DataFrame, featuresCol: String, labelCol: String): DataFrame = { val spark = dataset.sparkSession import spark.implicits._ SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT) SchemaUtils.checkNumericType(dataset.schema, labelCol) val rdd = dataset.select(col(labelCol).cast("double"), col(featuresCol)).as[(Double, Vector)] .rdd.map { case (label, features) => OldLabeledPoint(label, OldVectors.fromML(features)) } val testResults = OldStatistics.chiSqTest(rdd) val pValues = Vectors.dense(testResults.map(_.pValue)) val degreesOfFreedom = testResults.map(_.degreesOfFreedom) val statistics = Vectors.dense(testResults.map(_.statistic)) spark.createDataFrame(Seq(ChiSquareResult(pValues, degreesOfFreedom, statistics))) } @Since("3.1.0") def testChiSquare( dataset: Dataset[_], featuresCol: String, labelCol: String): Array[SelectionTestResult] = { SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT) SchemaUtils.checkNumericType(dataset.schema, labelCol) val input = dataset.select(col(labelCol).cast(DoubleType), col(featuresCol)).rdd .map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val chiTestResult = OldStatistics.chiSqTest(input) chiTestResult.map(r => new ChiSqTestResult(r.pValue, r.degreesOfFreedom, r.statistic)) } {code} The newly added one is targeted to 3.1.0, so we can modify its return type without breaking api > flatten the result dataframe of tests in stat > - > > Key: SPARK-31301 > URL: https://issues.apache.org/jira/browse/SPARK-31301 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Major > > {code:java} > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import > org.apache.spark.ml.stat.ChiSquareTest > import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq( > | (0.0, Vectors.dense(0.5, 10.0)), > | (0.0, Vectors.dense(1.5, 20.0)), > | (1.0, Vectors.dense(1.5, 30.0)), > | (0.0, Vectors.dense(3.5, 30.0)), > | (0.0, Vectors.dense(3.5, 40.0)), > | (1.0, Vectors.dense(3.5, 40.0)) > | ) > data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = > List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), > (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = > data.toDF("label", "features") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> >val chi = ChiSquareTest.test(df, "features", "label") > chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: > array ... 1 more field]scala> chi.show > +++--+ > | pValues|degreesOfFreedom|statistics| > +++--+ > |[0.68728927879097...| [2, 3]|[0.75,1.5]| > +++--+{code} > > Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, > {{Correlation}} all return a df only containing one row. > I think this is quite hard to use, suppose we have a dataset with dim=1000, > the only operation we can deal with the test result is to collect it by > {{head()}} or {{first(), and then use it in the driver.}} > {{While what I really want to do is filtering the df like pValue>0.1}} or > {{corr<0.5}}, *So I suggest to flatten the output df in those tests.* > > {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but > ChiSquareTest and Correlation were here for a long time. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org