[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278148#comment-16278148 ] zhengruifeng commented on SPARK-19634: -- I think we can now use the new summarizer in the algs. > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > Fix For: 2.3.0 > > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses
[ https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278142#comment-16278142 ] Hyukjin Kwon commented on SPARK-22674: -- If that deduplication brings performance regression or is unable/difficult to port it, we should consider a separate fix as you did. Sure. Sorry, I overlooked your comments. > PySpark breaks serialization of namedtuple subclasses > - > > Key: SPARK-22674 > URL: https://issues.apache.org/jira/browse/SPARK-22674 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Jonas Amrich > > Pyspark monkey patches the namedtuple class to make it serializable, however > this breaks serialization of its subclasses. With current implementation, any > subclass will be serialized (and deserialized) as it's parent namedtuple. > Consider this code, which will fail with {{AttributeError: 'Point' object has > no attribute 'sum'}}: > {code} > from collections import namedtuple > Point = namedtuple("Point", "x y") > class PointSubclass(Point): > def sum(self): > return self.x + self.y > rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]]) > rdd.collect()[0][0].sum() > {code} > Moreover, as PySpark hijacks all namedtuples in the main module, importing > pyspark breaks serialization of namedtuple subclasses even in code which is > not related to spark / distributed execution. I don't see any clean solution > to this; a possible workaround may be to limit serialization hack only to > direct namedtuple subclasses like in > https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22690) Imputer inherit HasOutputCols
[ https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22690: Assignee: (was: Apache Spark) > Imputer inherit HasOutputCols > - > > Key: SPARK-22690 > URL: https://issues.apache.org/jira/browse/SPARK-22690 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Priority: Trivial > > trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also > inherit it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22690) Imputer inherit HasOutputCols
[ https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22690: Assignee: Apache Spark > Imputer inherit HasOutputCols > - > > Key: SPARK-22690 > URL: https://issues.apache.org/jira/browse/SPARK-22690 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Trivial > > trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also > inherit it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22690) Imputer inherit HasOutputCols
[ https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278128#comment-16278128 ] Apache Spark commented on SPARK-22690: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/19889 > Imputer inherit HasOutputCols > - > > Key: SPARK-22690 > URL: https://issues.apache.org/jira/browse/SPARK-22690 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng >Priority: Trivial > > trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also > inherit it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22689) Could not resolve dependencies for project
[ https://issues.apache.org/jira/browse/SPARK-22689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Puja Mudaliar updated SPARK-22689: -- Description: Hello team, Spark code compile operation fails on few machines whereas the same source code passes on few other machines. Please check the error on Centos (4.10.12-1.el7.elrepo.x86_64) ./build/mvn -X -DskipTests -Dscala.lib.directory=/usr/share/scala -pl core compile INFO] Building Spark Project Core 2.2.2-SNAPSHOT [INFO] [WARNING] The POM for org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-network-common_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-network-shuffle_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-unsafe_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:tests:2.2.2-SNAPSHOT is missing, no dependency information available [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 0.804 s [INFO] Finished at: 2017-12-04T23:18:58-08:00 [INFO] Final Memory: 43M/1963M [INFO] [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:2.2.2-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT, org.apache.spark:spark-network-common_2.11:jar:2.2.2-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.11:jar:2.2.2-SNAPSHOT, org.apache.spark:spark-unsafe_2.11:jar:2.2.2-SNAPSHOT, org.apache.spark:spark-tags_2.11:jar:2.2.2-SNAPSHOT, org.apache.spark:spark-tags_2.11:jar:tests:2.2.2-SNAPSHOT: Failure to find org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT in http://artifact.eng.stellus.in:8081/artifactory/libs-snapshot was cached in the local repository, resolution will not be reattempted until the update interval of snapshots has elapsed or updates are forced -> [Help 1] Note: The same source code passes on another CentOS machine(3.10.0-514.el7.x86_64) ./build/mvn -DskipTests -Dscala.lib.directory=/usr/share/scala -pl core compile [INFO] --- maven-compiler-plugin:3.7.0:compile (default-compile) @ spark-core_2.11 --- [INFO] Not compiling main sources [INFO] [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ spark-core_2.11 --- [INFO] Using zinc server for incremental compilation [info] Compile success at Dec 4, 2017 11:17:34 PM [0.331s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 5.663 s [INFO] Finished at: 2017-12-04T23:17:34-08:00 [INFO] Final Memory: 52M/1297M [INFO] was: Hello team, Spark code compile operation fails on few machined whereas the same source code passes on few other machines.Issue is not related to kernel version as I had tried using difference kernel versions. Please check the error on Centos (4.10.12-1.el7.elrepo.x86_64) ./build/mvn -X -DskipTests -Dscala.lib.directory=/usr/share/scala -pl core compile INFO] Building Spark Project Core 2.2.2-SNAPSHOT [INFO] [WARNING] The POM for org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-network-common_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-network-shuffle_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-unsafe_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:tests:2.2.2-SNAPSHOT is missing, no dependency information available [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 0.804 s [INFO] Finished at:
[jira] [Created] (SPARK-22689) Could not resolve dependencies for project
Puja Mudaliar created SPARK-22689: - Summary: Could not resolve dependencies for project Key: SPARK-22689 URL: https://issues.apache.org/jira/browse/SPARK-22689 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.0 Reporter: Puja Mudaliar Priority: Blocker Hello team, Spark code compile operation fails on few machined whereas the same source code passes on few other machines.Issue is not related to kernel version as I had tried using difference kernel versions. Please check the error on Centos (4.10.12-1.el7.elrepo.x86_64) ./build/mvn -X -DskipTests -Dscala.lib.directory=/usr/share/scala -pl core compile INFO] Building Spark Project Core 2.2.2-SNAPSHOT [INFO] [WARNING] The POM for org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-network-common_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-network-shuffle_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-unsafe_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:2.2.2-SNAPSHOT is missing, no dependency information available [WARNING] The POM for org.apache.spark:spark-tags_2.11:jar:tests:2.2.2-SNAPSHOT is missing, no dependency information available [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 0.804 s [INFO] Finished at: 2017-12-04T23:18:58-08:00 [INFO] Final Memory: 43M/1963M [INFO] [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:2.2.2-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT, org.apache.spark:spark-network-common_2.11:jar:2.2.2-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.11:jar:2.2.2-SNAPSHOT, org.apache.spark:spark-unsafe_2.11:jar:2.2.2-SNAPSHOT, org.apache.spark:spark-tags_2.11:jar:2.2.2-SNAPSHOT, org.apache.spark:spark-tags_2.11:jar:tests:2.2.2-SNAPSHOT: Failure to find org.apache.spark:spark-launcher_2.11:jar:2.2.2-SNAPSHOT in http://artifact.eng.stellus.in:8081/artifactory/libs-snapshot was cached in the local repository, resolution will not be reattempted until the update interval of snapshots has elapsed or updates are forced -> [Help 1] Note: The same source code passes on another CentOS machine(3.10.0-514.el7.x86_64) ./build/mvn -DskipTests -Dscala.lib.directory=/usr/share/scala -pl core compile [INFO] --- maven-compiler-plugin:3.7.0:compile (default-compile) @ spark-core_2.11 --- [INFO] Not compiling main sources [INFO] [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ spark-core_2.11 --- [INFO] Using zinc server for incremental compilation [info] Compile success at Dec 4, 2017 11:17:34 PM [0.331s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 5.663 s [INFO] Finished at: 2017-12-04T23:17:34-08:00 [INFO] Final Memory: 52M/1297M [INFO] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22690) Imputer inherit HasOutputCols
zhengruifeng created SPARK-22690: Summary: Imputer inherit HasOutputCols Key: SPARK-22690 URL: https://issues.apache.org/jira/browse/SPARK-22690 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: zhengruifeng Priority: Trivial trait {{HasOutputCols}} was add in Spark-20542, {{Imputer}} should also inherit it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses
[ https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278121#comment-16278121 ] Hyukjin Kwon commented on SPARK-22674: -- Oh, sorry, I overlooked at {{ that regular pickle won't be able to unpickle namedtuples anymore.}}. I didn't mean to completely remove out the support with regular pickle one but deduplicates the logic for serializers if possible and matches PySpark's copy to specific version of cloudpickle, if possible. I'd like to avoid a separate fix within PySpark if we can. > PySpark breaks serialization of namedtuple subclasses > - > > Key: SPARK-22674 > URL: https://issues.apache.org/jira/browse/SPARK-22674 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Jonas Amrich > > Pyspark monkey patches the namedtuple class to make it serializable, however > this breaks serialization of its subclasses. With current implementation, any > subclass will be serialized (and deserialized) as it's parent namedtuple. > Consider this code, which will fail with {{AttributeError: 'Point' object has > no attribute 'sum'}}: > {code} > from collections import namedtuple > Point = namedtuple("Point", "x y") > class PointSubclass(Point): > def sum(self): > return self.x + self.y > rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]]) > rdd.collect()[0][0].sum() > {code} > Moreover, as PySpark hijacks all namedtuples in the main module, importing > pyspark breaks serialization of namedtuple subclasses even in code which is > not related to spark / distributed execution. I don't see any clean solution > to this; a possible workaround may be to limit serialization hack only to > direct namedtuple subclasses like in > https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22688) Upgrade Janino version 3.0.8
[ https://issues.apache.org/jira/browse/SPARK-22688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-22688: - Description: [Janino 3.0.8|https://janino-compiler.github.io/janino/changelog.html] includes an important fix to reduce the number of constant pool entries by using {{sipush}} java bytecode. * SIPUSH bytecode is not used for short integer constant [#33|https://github.com/janino-compiler/janino/issues/33] was: [Janino 0.3.8|https://janino-compiler.github.io/janino/changelog.html] includes an important fix to reduce the number of constant pool entries by using {{sipush}} java bytecode. * SIPUSH bytecode is not used for short integer constant [#33|https://github.com/janino-compiler/janino/issues/33] > Upgrade Janino version 3.0.8 > > > Key: SPARK-22688 > URL: https://issues.apache.org/jira/browse/SPARK-22688 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > [Janino 3.0.8|https://janino-compiler.github.io/janino/changelog.html] > includes an important fix to reduce the number of constant pool entries by > using {{sipush}} java bytecode. > * SIPUSH bytecode is not used for short integer constant > [#33|https://github.com/janino-compiler/janino/issues/33] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22688) Upgrade Janino version 3.0.8
[ https://issues.apache.org/jira/browse/SPARK-22688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-22688: - Summary: Upgrade Janino version 3.0.8 (was: Upgrade Janino version 0.3.8) > Upgrade Janino version 3.0.8 > > > Key: SPARK-22688 > URL: https://issues.apache.org/jira/browse/SPARK-22688 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > [Janino 0.3.8|https://janino-compiler.github.io/janino/changelog.html] > includes an important fix to reduce the number of constant pool entries by > using {{sipush}} java bytecode. > * SIPUSH bytecode is not used for short integer constant > [#33|https://github.com/janino-compiler/janino/issues/33] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22688) Upgrade Janino version 0.3.8
Kazuaki Ishizaki created SPARK-22688: Summary: Upgrade Janino version 0.3.8 Key: SPARK-22688 URL: https://issues.apache.org/jira/browse/SPARK-22688 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki [Janino 0.3.8|https://janino-compiler.github.io/janino/changelog.html] includes an important fix to reduce the number of constant pool entries by using {{sipush}} java bytecode. * SIPUSH bytecode is not used for short integer constant [#33|https://github.com/janino-compiler/janino/issues/33] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22660) Compile with scala-2.12 and JDK9
[ https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278064#comment-16278064 ] liyunzhang commented on SPARK-22660: Ok,create SPARK-22687 to record the problem about runtime. {quote} But here you are already Hadoop 2 won't work with Java 9. {quote} sorry for not describing clearly, here the hadoop is hadoop-3.0.0 which is enabled by jdk9(HADOOP-14984, HADOOP-14978) > Compile with scala-2.12 and JDK9 > > > Key: SPARK-22660 > URL: https://issues.apache.org/jira/browse/SPARK-22660 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang >Priority: Minor > > build with scala-2.12 with following steps > 1. change the pom.xml with scala-2.12 > ./dev/change-scala-version.sh 2.12 > 2.build with -Pscala-2.12 > for hive on spark > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn > -Pparquet-provided -Dhadoop.version=2.7.3 > {code} > for spark sql > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn -Phive > -Dhadoop.version=2.7.3>log.sparksql 2>&1 > {code} > get following error > #Error1 > {code} > /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: > error: cannot find symbol > Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); > {code} > This is because sun.misc.Cleaner has been moved to new location in JDK9. > HADOOP-12760 will be the long term fix > #Error2 > {code} > spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: > ambiguous reference to overloaded definition, method limit in class > ByteBuffer of type (x$1: Int)java.nio.ByteBuffer > method limit in class Buffer of type ()Int > match expected type ? > val resultSize = serializedDirectResult.limit > error > {code} > The limit method was moved from ByteBuffer to the superclass Buffer and it > can no longer be called without (). The same reason for position method. > #Error3 > {code} > home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415: > ambiguous reference to overloaded definition, [error] both method putAll in > class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method > putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: > Object])Unit [error] match argument types (java.util.Map[String,String]) > [error] properties.putAll(propsMap.asJava) > [error]^ > [error] > /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427: > ambiguous reference to overloaded definition, [error] both method putAll in > class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method > putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: > Object])Unit [error] match argument types (java.util.Map[String,String]) > [error] props.putAll(outputSerdeProps.toMap.asJava) > [error] ^ > {code} > This is because the key type is Object instead of String which is unsafe. > After solving these 3 errors, compile successfully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22687) Run spark-sql in scala-2.12 and JDK9
liyunzhang created SPARK-22687: -- Summary: Run spark-sql in scala-2.12 and JDK9 Key: SPARK-22687 URL: https://issues.apache.org/jira/browse/SPARK-22687 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.0 Reporter: liyunzhang Based on SPARK-22660, running spark sql in scala-2.12 and JDK9 env. Here the hadoop used is enabled by JDK9(See HADOOP-14984, HADOOP-14978) Here exception is {code} [root@bdpe41 spark-2.3.0-SNAPSHOT-bin-2.7.3]# ./bin/spark-shell spark-2.3.0-SNAPSHOT-bin-2.7. ^C[root@bdpe41 spark-2.3.0-SNAPSHOT-bin-2.7.3]# ./bin/spark-shell --driver-memory 1G WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/home/zly/spark-2.3.0-SNAPSHOT-bin-2.7.3/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance() WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 2017-12-05 03:03:23,511 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.12.4 (Java HotSpot(TM) 64-Bit Server VM, Java 9.0.1) Type in expressions to have them evaluated. Type :help for more information. scala> Spark context Web UI available at http://bdpe41:4040 Spark context available as 'sc' (master = local[*], app id = local-1512414208378). Spark session available as 'spark'. val sqlContext = new org.apache.spark.sql.SQLContext(sc) warning: there was one deprecation warning (since 2.0.0); for details, enable `:setting -deprecation' or `:replay -deprecation' sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@8da0e54 scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> case class Customer(customer_id: Int, name: String, city: String, state: String, zip_code: String) defined class Customer scala> val dfCustomers = sc.textFile("/home/zly/spark-2.3.0-SNAPSHOT-bin-2.7.3/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF() 2017-12-05 03:04:02,647 WARN util.ClosureCleaner: Expected a closure; got org.apache.spark.SparkContext$$Lambda$2237/371823738 2017-12-05 03:04:02,649 WARN util.ClosureCleaner: Expected a closure; got org.apache.spark.SparkContext$$Lambda$2242/539107678 2017-12-05 03:04:02,651 WARN util.ClosureCleaner: Expected a closure; got $line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$2245/345086812 2017-12-05 03:04:02,654 WARN util.ClosureCleaner: Expected a closure; got $line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$2246/1829622584 2017-12-05 03:04:03,861 WARN metadata.Hive: Failed to access metastore. This class should not accessed in runtime. org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:114) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:488) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:383) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at
[jira] [Commented] (SPARK-8971) Support balanced class labels when splitting train/cross validation sets
[ https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278060#comment-16278060 ] Ashish Chopra commented on SPARK-8971: -- When can we expect this in Dataframe API? > Support balanced class labels when splitting train/cross validation sets > > > Key: SPARK-8971 > URL: https://issues.apache.org/jira/browse/SPARK-8971 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Feynman Liang >Assignee: Seth Hendrickson > > {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are > Spark classes which partition data into training and evaluation sets for > performing hyperparameter selection via cross validation. > Both methods currently perform the split by randomly sampling the datasets. > However, when class probabilities are highly imbalanced (e.g. detection of > extremely low-frequency events), random sampling may result in cross > validation sets not representative of actual out-of-training performance > (e.g. no positive training examples could be included). > Mainstream R packages like already > [caret|http://topepo.github.io/caret/splitting.html] support splitting the > data based upon the class labels. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22686) DROP TABLE IF EXISTS should not throw AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-22686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22686: Assignee: Apache Spark > DROP TABLE IF EXISTS should not throw AnalysisException > --- > > Key: SPARK-22686 > URL: https://issues.apache.org/jira/browse/SPARK-22686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > During SPARK-22488 to Fix the view resolution issue, there occurs a > regression at 2.2.1 and master branch like the following. > {code} > scala> spark.version > res2: String = 2.2.1 > scala> sql("DROP TABLE IF EXISTS t").show > 17/12/04 21:01:06 WARN DropTableCommand: > org.apache.spark.sql.AnalysisException: Table or view not found: t; > org.apache.spark.sql.AnalysisException: Table or view not found: t; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22686) DROP TABLE IF EXISTS should not throw AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-22686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22686: Assignee: (was: Apache Spark) > DROP TABLE IF EXISTS should not throw AnalysisException > --- > > Key: SPARK-22686 > URL: https://issues.apache.org/jira/browse/SPARK-22686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun > > During SPARK-22488 to Fix the view resolution issue, there occurs a > regression at 2.2.1 and master branch like the following. > {code} > scala> spark.version > res2: String = 2.2.1 > scala> sql("DROP TABLE IF EXISTS t").show > 17/12/04 21:01:06 WARN DropTableCommand: > org.apache.spark.sql.AnalysisException: Table or view not found: t; > org.apache.spark.sql.AnalysisException: Table or view not found: t; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22686) DROP TABLE IF EXISTS should not throw AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-22686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278022#comment-16278022 ] Apache Spark commented on SPARK-22686: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19888 > DROP TABLE IF EXISTS should not throw AnalysisException > --- > > Key: SPARK-22686 > URL: https://issues.apache.org/jira/browse/SPARK-22686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun > > During SPARK-22488 to Fix the view resolution issue, there occurs a > regression at 2.2.1 and master branch like the following. > {code} > scala> spark.version > res2: String = 2.2.1 > scala> sql("DROP TABLE IF EXISTS t").show > 17/12/04 21:01:06 WARN DropTableCommand: > org.apache.spark.sql.AnalysisException: Table or view not found: t; > org.apache.spark.sql.AnalysisException: Table or view not found: t; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22686) DROP TABLE IF EXISTS should not throw AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-22686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22686: -- Summary: DROP TABLE IF EXISTS should not throw AnalysisException (was: DROP TABLE IF NOT EXISTS should not throw AnalysisException) > DROP TABLE IF EXISTS should not throw AnalysisException > --- > > Key: SPARK-22686 > URL: https://issues.apache.org/jira/browse/SPARK-22686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun > > During SPARK-22488 to Fix the view resolution issue, there occurs a > regression at 2.2.1 and master branch like the following. > {code} > scala> spark.version > res2: String = 2.2.1 > scala> sql("DROP TABLE IF EXISTS t").show > 17/12/04 21:01:06 WARN DropTableCommand: > org.apache.spark.sql.AnalysisException: Table or view not found: t; > org.apache.spark.sql.AnalysisException: Table or view not found: t; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22686) DROP TABLE IF NOT EXISTS should not throw AnalysisException
Dongjoon Hyun created SPARK-22686: - Summary: DROP TABLE IF NOT EXISTS should not throw AnalysisException Key: SPARK-22686 URL: https://issues.apache.org/jira/browse/SPARK-22686 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: Dongjoon Hyun During SPARK-22488 to Fix the view resolution issue, there occurs a regression at 2.2.1 and master branch like the following. {code} scala> spark.version res2: String = 2.2.1 scala> sql("DROP TABLE IF EXISTS t").show 17/12/04 21:01:06 WARN DropTableCommand: org.apache.spark.sql.AnalysisException: Table or view not found: t; org.apache.spark.sql.AnalysisException: Table or view not found: t; {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22682) HashExpression does not need to create global variables
[ https://issues.apache.org/jira/browse/SPARK-22682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22682. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19878 [https://github.com/apache/spark/pull/19878] > HashExpression does not need to create global variables > --- > > Key: SPARK-22682 > URL: https://issues.apache.org/jira/browse/SPARK-22682 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22677) cleanup whole stage codegen for hash aggregate
[ https://issues.apache.org/jira/browse/SPARK-22677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22677. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19869 [https://github.com/apache/spark/pull/19869] > cleanup whole stage codegen for hash aggregate > -- > > Key: SPARK-22677 > URL: https://issues.apache.org/jira/browse/SPARK-22677 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22365) Spark UI executors empty list with 500 error
[ https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276618#comment-16276618 ] bruce xu edited comment on SPARK-22365 at 12/5/17 4:31 AM: --- Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I will also try to find the reason. Maybe it's a bug anyway. UPDATE: [~dubovsky] I solved the problem by deleting jsr311-api-1.1.1.jar from $SPARK_HOME/jars. Reasons can be refered through [NoSuchMethodError on startup in Java Jersey app|https://stackoverflow.com/questions/28509370/nosuchmethoderror-on-startup-in-java-jersey-app]. [~sowen] Delete jsr311-api-1.1.1.jar could solve the problem, but I wonder if this is the root cause. was (Author: xwc3504): Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I will also try to find the reason. Maybe it's a bug anyway. UPDATE: [~dubovsky] I solved the problem by deleting jsr311-api-1.1.1.jar from $SPARK_HOME/jars. Reasons can be refered through [NoSuchMethodError on startup in Java Jersey app|https://stackoverflow.com/questions/28509370/nosuchmethoderror-on-startup-in-java-jersey-app] > Spark UI executors empty list with 500 error > > > Key: SPARK-22365 > URL: https://issues.apache.org/jira/browse/SPARK-22365 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: Jakub Dubovsky > Attachments: spark-executor-500error.png > > > No data loaded on "execturos" tab in sparkUI with stack trace below. Apart > from exception I have nothing more. But if I can test something to make this > easier to resolve I am happy to help. > {code} > java.lang.NullPointerException > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:524) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22365) Spark UI executors empty list with 500 error
[ https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276618#comment-16276618 ] bruce xu edited comment on SPARK-22365 at 12/5/17 3:47 AM: --- Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I will also try to find the reason. Maybe it's a bug anyway. UPDATE: [~dubovsky] I solved the problem by deleting jsr311-api-1.1.1.jar from $SPARK_HOME/jars. Reasons can be refered through [NoSuchMethodError on startup in Java Jersey app|https://stackoverflow.com/questions/28509370/nosuchmethoderror-on-startup-in-java-jersey-app] was (Author: xwc3504): Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I will also try to find the reason. Maybe it's a bug anyway. UPDATE: [~dubovsky] I solved the problem by deleting jsr311-api-1.1.1.jar from $SPARK_HOME/jars. Reasons can be refered through [NoSuchMethodError on startup in Java Jersey app |https://stackoverflow.com/questions/28509370/nosuchmethoderror-on-startup-in-java-jersey-app] > Spark UI executors empty list with 500 error > > > Key: SPARK-22365 > URL: https://issues.apache.org/jira/browse/SPARK-22365 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: Jakub Dubovsky > Attachments: spark-executor-500error.png > > > No data loaded on "execturos" tab in sparkUI with stack trace below. Apart > from exception I have nothing more. But if I can test something to make this > easier to resolve I am happy to help. > {code} > java.lang.NullPointerException > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:524) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22365) Spark UI executors empty list with 500 error
[ https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276618#comment-16276618 ] bruce xu edited comment on SPARK-22365 at 12/5/17 3:46 AM: --- Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I will also try to find the reason. Maybe it's a bug anyway. UPDATE: [~dubovsky] I solved the problem by deleting jsr311-api-1.1.1.jar from $SPARK_HOME/jars. Reasons can be refered through [NoSuchMethodError on startup in Java Jersey app |https://stackoverflow.com/questions/28509370/nosuchmethoderror-on-startup-in-java-jersey-app] was (Author: xwc3504): Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I will also try to find the reason. Maybe it's a bug anyway. > Spark UI executors empty list with 500 error > > > Key: SPARK-22365 > URL: https://issues.apache.org/jira/browse/SPARK-22365 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: Jakub Dubovsky > Attachments: spark-executor-500error.png > > > No data loaded on "execturos" tab in sparkUI with stack trace below. Apart > from exception I have nothing more. But if I can test something to make this > easier to resolve I am happy to help. > {code} > java.lang.NullPointerException > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:524) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18801) Support resolve a nested view
[ https://issues.apache.org/jira/browse/SPARK-18801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18801: -- Fix Version/s: 2.2.0 > Support resolve a nested view > - > > Key: SPARK-18801 > URL: https://issues.apache.org/jira/browse/SPARK-18801 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo > Fix For: 2.2.0 > > > We should be able to resolve a nested view. The main advantage is that if you > update an underlying view, the current view also gets updated. > The new approach should be compatible with older versions of SPARK/HIVE, that > means: > 1. The new approach should be able to resolve the views that created by > older versions of SPARK/HIVE; > 2. The new approach should be able to resolve the views that are > currently supported by SPARK SQL. > The new approach mainly brings in the following changes: > 1. Add a new operator called `View` to keep track of the CatalogTable > that describes the view, and the output attributes as well as the child of > the view; > 2. Update the `ResolveRelations` rule to resolve the relations and > views, note that a nested view should be resolved correctly; > 3. Add `AnalysisContext` to enable us to still support a view created > with CTE/Windows query. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21168) KafkaRDD should always set kafka clientId.
[ https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277939#comment-16277939 ] Apache Spark commented on SPARK-21168: -- User 'liu-zhaokun' has created a pull request for this issue: https://github.com/apache/spark/pull/19887 > KafkaRDD should always set kafka clientId. > -- > > Key: SPARK-21168 > URL: https://issues.apache.org/jira/browse/SPARK-21168 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Xingxing Di >Priority: Trivial > > I found KafkaRDD not set kafka client.id in "fetchBatch" method > (FetchRequestBuilder will set clientId to empty by default), normally this > will affect nothing, but in our case ,we use clientId at kafka server side, > so we have to rebuild spark-streaming-kafka。 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22685) Spark Streaming using Kinesis doesn't work if shard checkpoints exist in DynamoDB
Grega Kespret created SPARK-22685: - Summary: Spark Streaming using Kinesis doesn't work if shard checkpoints exist in DynamoDB Key: SPARK-22685 URL: https://issues.apache.org/jira/browse/SPARK-22685 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.2.0 Reporter: Grega Kespret Apologize if this is not the best place to post this / if the description is lacking some needed info. Please let me know and I will update. This was cross-posted on [StackOverflow|https://stackoverflow.com/questions/47644984/spark-streaming-using-kinesis-doesnt-work-if-shard-checkpoints-exist-in-dynamod]. **TL;DR** – If shard checkpoints don't exist in DynamoDB (== completely fresh), Spark Streaming application reading from Kinesis works flawlessly. However, if the checkpoints exist (e.g. due to app restart), it fails most of the times. The app uses **Spark Streaming 2.2.0** and **spark-streaming-kinesis-asl_2.11**. When starting the app with checkpointed shard data (written by KCL to DynamoDB), after a few successful batches (number varies), this is what I can see in the logs: First, **Leases are lost**: {code} 17/12/01 05:16:50 INFO LeaseRenewer: Worker 10.0.182.119:9781acd5-6cb3-4a39-a235-46f1254eb885 lost lease with key shardId-0515 {code} Then in random order: **Can't update checkpoint - instance doesn't hold the lease for this shard** and **com.amazonaws.SdkClientException: Unable to execute HTTP request: The target server failed to respond** follow, bringing down the whole app in a few batches: {code} 17/12/01 05:17:10 ERROR ProcessTask: ShardId shardId-0394: Caught exception: com.amazonaws.SdkClientException: Unable to execute HTTP request: The target server failed to respond at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1069) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1035) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:1948) at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:1924) at com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:969) at com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:945) at com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.get(KinesisProxy.java:156) at com.amazonaws.services.kinesis.clientlibrary.proxies.MetricsCollectingKinesisProxyDecorator.get(MetricsCollectingKinesisProxyDecorator.java:74) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisDataFetcher.getRecords(KinesisDataFetcher.java:68) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResultAndRecordMillisBehindLatest(ProcessTask.java:291) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResult(ProcessTask.java:256) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:127) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49) at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.http.NoHttpResponseException: The target server failed to respond at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) at
[jira] [Commented] (SPARK-21168) KafkaRDD should always set kafka clientId.
[ https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277875#comment-16277875 ] liuzhaokun commented on SPARK-21168: [~dixingx...@yeah.net] Hi,as your PR are not in progess,can I create a new PR to fix this problems? > KafkaRDD should always set kafka clientId. > -- > > Key: SPARK-21168 > URL: https://issues.apache.org/jira/browse/SPARK-21168 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Xingxing Di >Priority: Trivial > > I found KafkaRDD not set kafka client.id in "fetchBatch" method > (FetchRequestBuilder will set clientId to empty by default), normally this > will affect nothing, but in our case ,we use clientId at kafka server side, > so we have to rebuild spark-streaming-kafka。 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22656) Upgrade Arrow to 0.8.0
[ https://issues.apache.org/jira/browse/SPARK-22656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-22656. -- Resolution: Duplicate > Upgrade Arrow to 0.8.0 > -- > > Key: SPARK-22656 > URL: https://issues.apache.org/jira/browse/SPARK-22656 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > Arrow 0.8.0 will upgrade Netty to 4.1.x and unblock SPARK-19552 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22665) Dataset API: .repartition() inconsistency / issue
[ https://issues.apache.org/jira/browse/SPARK-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-22665: --- Assignee: Marco Gaido > Dataset API: .repartition() inconsistency / issue > - > > Key: SPARK-22665 > URL: https://issues.apache.org/jira/browse/SPARK-22665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrian Ionescu >Assignee: Marco Gaido > Fix For: 2.3.0 > > > We currently have two functions for explicitly repartitioning a Dataset: > {code} > def repartition(numPartitions: Int) > {code} > and > {code} > def repartition(numPartitions: Int, partitionExprs: Column*) > {code} > The second function's signature allows it to be called with an empty list of > expressions as well. > However: > * {{df.repartition(numPartitions)}} does RoundRobin partitioning > * {{df.repartition(numPartitions, Seq.empty: _*)}} does HashPartitioning on a > constant, effectively moving all tuples to a single partition > Not only is this inconsistent, but the latter behavior is very undesirable: > it may hide problems in small-scale prototype code, but will inevitably fail > (or have terrible performance) in production. > I suggest we should make it: > - either throw an {{IllegalArgumentException}} > - or do RoundRobin partitioning, just like {{df.repartition(numPartitions)}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22665) Dataset API: .repartition() inconsistency / issue
[ https://issues.apache.org/jira/browse/SPARK-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22665. - Resolution: Fixed Fix Version/s: 2.3.0 > Dataset API: .repartition() inconsistency / issue > - > > Key: SPARK-22665 > URL: https://issues.apache.org/jira/browse/SPARK-22665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrian Ionescu >Assignee: Marco Gaido > Fix For: 2.3.0 > > > We currently have two functions for explicitly repartitioning a Dataset: > {code} > def repartition(numPartitions: Int) > {code} > and > {code} > def repartition(numPartitions: Int, partitionExprs: Column*) > {code} > The second function's signature allows it to be called with an empty list of > expressions as well. > However: > * {{df.repartition(numPartitions)}} does RoundRobin partitioning > * {{df.repartition(numPartitions, Seq.empty: _*)}} does HashPartitioning on a > constant, effectively moving all tuples to a single partition > Not only is this inconsistent, but the latter behavior is very undesirable: > it may hide problems in small-scale prototype code, but will inevitably fail > (or have terrible performance) in production. > I suggest we should make it: > - either throw an {{IllegalArgumentException}} > - or do RoundRobin partitioning, just like {{df.repartition(numPartitions)}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22162) Executors and the driver use inconsistent Job IDs during the new RDD commit protocol
[ https://issues.apache.org/jira/browse/SPARK-22162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277795#comment-16277795 ] Apache Spark commented on SPARK-22162: -- User 'rezasafi' has created a pull request for this issue: https://github.com/apache/spark/pull/19886 > Executors and the driver use inconsistent Job IDs during the new RDD commit > protocol > > > Key: SPARK-22162 > URL: https://issues.apache.org/jira/browse/SPARK-22162 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Reza Safi >Assignee: Reza Safi > Fix For: 2.3.0 > > > After SPARK-18191 commit in pull request 15769, using the new commit protocol > it is possible that driver and executors uses different jobIds during a rdd > commit. > In the old code, the variable stageId is part of the closure used to define > the task as you can see here: > > [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1098] > As a result, a TaskAttemptId is constructed in executors using the same > "stageId" as the driver, since it is a value that is serialized in the > driver. Also the value of stageID is actually the rdd.id which is assigned > here: > [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1084] > However, after the change in pull request 15769, the value is no longer part > of the task closure, which gets serialized by the driver. Instead, it is > pulled from the taskContext as you can see > here:[https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R103] > and then that value is used to construct the TaskAttemptId on the executors: > [https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R134] > taskContext has a stageID value which will be set in DAGScheduler. So after > the change unlike the old code which a rdd.id was used, an actual stage.id is > used which can be different between executors and the driver since it is no > longer serialized. > In summary, the old code consistently used rddId, and just incorrectly named > it "stageId". > The new code uses a mix of rddId and stageId. There should be a consistent ID > between executors and the drivers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url
[ https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277787#comment-16277787 ] Apache Spark commented on SPARK-22587: -- User 'merlintang' has created a pull request for this issue: https://github.com/apache/spark/pull/19885 > Spark job fails if fs.defaultFS and application jar are different url > - > > Key: SPARK-22587 > URL: https://issues.apache.org/jira/browse/SPARK-22587 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph > > Spark Job fails if the fs.defaultFs and url where application jar resides are > different and having same scheme, > spark-submit --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py > core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop > fs -ls) works for both the url XXX and YYY. > {code} > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > wasb://XXX/tmp/test.py, expected: wasb://YYY > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251) > > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) > at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507) > > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912) > > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751) > > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > The code Client.copyFileToRemote tries to resolve the path of application jar > (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead > of the actual url of application jar. > val destFs = destDir.getFileSystem(hadoopConf) > val srcFs = srcPath.getFileSystem(hadoopConf) > getFileSystem will create the filesystem based on the url of the path and so > this is fine. But the below lines of code tries to get the srcPath (XXX url) > from the destFs (YYY url) and so it fails. > var destPath = srcPath > val qualifiedDestPath = destFs.makeQualified(destPath) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url
[ https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273745#comment-16273745 ] Mingjie Tang edited comment on SPARK-22587 at 12/5/17 12:01 AM: we can update the compareFS by considering the authority. https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1442 The PR is sent out. https://github.com/apache/spark/pull/19885 was (Author: merlin): we can update the compareFS by considering the authority. https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1442 I would send out a PR soon. > Spark job fails if fs.defaultFS and application jar are different url > - > > Key: SPARK-22587 > URL: https://issues.apache.org/jira/browse/SPARK-22587 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph > > Spark Job fails if the fs.defaultFs and url where application jar resides are > different and having same scheme, > spark-submit --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py > core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop > fs -ls) works for both the url XXX and YYY. > {code} > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > wasb://XXX/tmp/test.py, expected: wasb://YYY > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251) > > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) > at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507) > > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912) > > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751) > > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > The code Client.copyFileToRemote tries to resolve the path of application jar > (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead > of the actual url of application jar. > val destFs = destDir.getFileSystem(hadoopConf) > val srcFs = srcPath.getFileSystem(hadoopConf) > getFileSystem will create the filesystem based on the url of the path and so > this is fine. But the below lines of code tries to get the srcPath (XXX url) > from the destFs (YYY url) and so it fails. > var destPath = srcPath > val qualifiedDestPath = destFs.makeQualified(destPath) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url
[ https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22587: Assignee: (was: Apache Spark) > Spark job fails if fs.defaultFS and application jar are different url > - > > Key: SPARK-22587 > URL: https://issues.apache.org/jira/browse/SPARK-22587 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph > > Spark Job fails if the fs.defaultFs and url where application jar resides are > different and having same scheme, > spark-submit --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py > core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop > fs -ls) works for both the url XXX and YYY. > {code} > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > wasb://XXX/tmp/test.py, expected: wasb://YYY > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251) > > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) > at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507) > > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912) > > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751) > > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > The code Client.copyFileToRemote tries to resolve the path of application jar > (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead > of the actual url of application jar. > val destFs = destDir.getFileSystem(hadoopConf) > val srcFs = srcPath.getFileSystem(hadoopConf) > getFileSystem will create the filesystem based on the url of the path and so > this is fine. But the below lines of code tries to get the srcPath (XXX url) > from the destFs (YYY url) and so it fails. > var destPath = srcPath > val qualifiedDestPath = destFs.makeQualified(destPath) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url
[ https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22587: Assignee: Apache Spark > Spark job fails if fs.defaultFS and application jar are different url > - > > Key: SPARK-22587 > URL: https://issues.apache.org/jira/browse/SPARK-22587 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Assignee: Apache Spark > > Spark Job fails if the fs.defaultFs and url where application jar resides are > different and having same scheme, > spark-submit --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py > core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop > fs -ls) works for both the url XXX and YYY. > {code} > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > wasb://XXX/tmp/test.py, expected: wasb://YYY > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251) > > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) > at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507) > > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912) > > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751) > > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > The code Client.copyFileToRemote tries to resolve the path of application jar > (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead > of the actual url of application jar. > val destFs = destDir.getFileSystem(hadoopConf) > val srcFs = srcPath.getFileSystem(hadoopConf) > getFileSystem will create the filesystem based on the url of the path and so > this is fine. But the below lines of code tries to get the srcPath (XXX url) > from the destFs (YYY url) and so it fails. > var destPath = srcPath > val qualifiedDestPath = destFs.makeQualified(destPath) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22324) Upgrade Arrow to version 0.8.0
[ https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22324: Assignee: Apache Spark > Upgrade Arrow to version 0.8.0 > -- > > Key: SPARK-22324 > URL: https://issues.apache.org/jira/browse/SPARK-22324 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Apache Spark > > Arrow version 0.8.0 is slated for release in early November, but I'd like to > start discussing to help get all the work that's being done synced up. > Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test > envs will need to be upgraded as well that will take a fair amount of work > and planning. > One topic I'd like to discuss is if pyarrow should be an installation > requirement for pyspark, i.e. when a user pip installs pyspark, it will also > install pyarrow. If not, then is there a minimum version that needs to be > supported? We currently have 0.4.1 installed on Jenkins. > There are a number of improvements and cleanups in the current code that can > happen depending on what we decide (I'll link them all here later, but off > the top of my head): > * Decimal bug fix and improved support > * Improved internal casting between pyarrow and pandas (can clean up some > workarounds), this will also verify data bounds if the user specifies a type > and data overflows. see > https://github.com/apache/spark/pull/19459#discussion_r146421804 > * Better type checking when converting Spark types to Arrow > * Timestamp conversion to microseconds (for Spark internal format) > * Full support for using validity mask with 'object' types > https://github.com/apache/spark/pull/18664#discussion_r146567335 > * VectorSchemaRoot can call close more than once to simplify listener > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22324) Upgrade Arrow to version 0.8.0
[ https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277755#comment-16277755 ] Apache Spark commented on SPARK-22324: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/19884 > Upgrade Arrow to version 0.8.0 > -- > > Key: SPARK-22324 > URL: https://issues.apache.org/jira/browse/SPARK-22324 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > Arrow version 0.8.0 is slated for release in early November, but I'd like to > start discussing to help get all the work that's being done synced up. > Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test > envs will need to be upgraded as well that will take a fair amount of work > and planning. > One topic I'd like to discuss is if pyarrow should be an installation > requirement for pyspark, i.e. when a user pip installs pyspark, it will also > install pyarrow. If not, then is there a minimum version that needs to be > supported? We currently have 0.4.1 installed on Jenkins. > There are a number of improvements and cleanups in the current code that can > happen depending on what we decide (I'll link them all here later, but off > the top of my head): > * Decimal bug fix and improved support > * Improved internal casting between pyarrow and pandas (can clean up some > workarounds), this will also verify data bounds if the user specifies a type > and data overflows. see > https://github.com/apache/spark/pull/19459#discussion_r146421804 > * Better type checking when converting Spark types to Arrow > * Timestamp conversion to microseconds (for Spark internal format) > * Full support for using validity mask with 'object' types > https://github.com/apache/spark/pull/18664#discussion_r146567335 > * VectorSchemaRoot can call close more than once to simplify listener > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22324) Upgrade Arrow to version 0.8.0
[ https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22324: Assignee: (was: Apache Spark) > Upgrade Arrow to version 0.8.0 > -- > > Key: SPARK-22324 > URL: https://issues.apache.org/jira/browse/SPARK-22324 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > Arrow version 0.8.0 is slated for release in early November, but I'd like to > start discussing to help get all the work that's being done synced up. > Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test > envs will need to be upgraded as well that will take a fair amount of work > and planning. > One topic I'd like to discuss is if pyarrow should be an installation > requirement for pyspark, i.e. when a user pip installs pyspark, it will also > install pyarrow. If not, then is there a minimum version that needs to be > supported? We currently have 0.4.1 installed on Jenkins. > There are a number of improvements and cleanups in the current code that can > happen depending on what we decide (I'll link them all here later, but off > the top of my head): > * Decimal bug fix and improved support > * Improved internal casting between pyarrow and pandas (can clean up some > workarounds), this will also verify data bounds if the user specifies a type > and data overflows. see > https://github.com/apache/spark/pull/19459#discussion_r146421804 > * Better type checking when converting Spark types to Arrow > * Timestamp conversion to microseconds (for Spark internal format) > * Full support for using validity mask with 'object' types > https://github.com/apache/spark/pull/18664#discussion_r146567335 > * VectorSchemaRoot can call close more than once to simplify listener > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22599) Avoid extra reading for cached table
[ https://issues.apache.org/jira/browse/SPARK-22599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277754#comment-16277754 ] Rajesh Balamohan commented on SPARK-22599: -- [~CodingCat] - Thanks for sharing results. Results mentions"SPARK-22599, master branch, parquet". Does it mean that "SPARK-22599, master branch" were run with text data? > Avoid extra reading for cached table > > > Key: SPARK-22599 > URL: https://issues.apache.org/jira/browse/SPARK-22599 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Nan Zhu > > In the current implementation of Spark, InMemoryTableExec read all data in a > cached table, filter CachedBatch according to stats and pass data to the > downstream operators. This implementation makes it inefficient to reside the > whole table in memory to serve various queries against different partitions > of the table, which occupies a certain portion of our users' scenarios. > The following is an example of such a use case: > store_sales is a 1TB-sized table in cloud storage, which is partitioned by > 'location'. The first query, Q1, wants to output several metrics A, B, C for > all stores in all locations. After that, a small team of 3 data scientists > wants to do some causal analysis for the sales in different locations. To > avoid unnecessary I/O and parquet/orc parsing overhead, they want to cache > the whole table in memory in Q1. > With the current implementation, even any one of the data scientists is only > interested in one out of three locations, the queries they submit to Spark > cluster is still reading 1TB data completely. > The reason behind the extra reading operation is that we implement > CachedBatch as > {code} > case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: > InternalRow) > {code} > where "stats" is a part of every CachedBatch, so we can only filter batches > for output of InMemoryTableExec operator by reading all data in in-memory > table as input. The extra reading would be even more unacceptable when some > of the table's data is evicted to disks. > We propose to introduce a new type of block, metadata block, for the > partitions of RDD representing data in the cached table. Every metadata block > contains stats info for all columns in a partition and is saved to > BlockManager when executing compute() method for the partition. To minimize > the number of bytes to read, > More details can be found in design > doc:https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing > performance test results: > Environment: 6 Executors, each of which has 16 cores 90G memory > dataset: 1T TPCDS data > queries: tested 4 queries (Q19, Q46, Q34, Q27) in > https://github.com/databricks/spark-sql-perf/blob/c2224f37e50628c5c8691be69414ec7f5a3d919a/src/main/scala/com/databricks/spark/sql/perf/tpcds/ImpalaKitQueries.scala > results: > https://docs.google.com/spreadsheets/d/1A20LxqZzAxMjW7ptAJZF4hMBaHxKGk3TBEQoAJXfzCI/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20368) Support Sentry on PySpark workers
[ https://issues.apache.org/jira/browse/SPARK-20368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277710#comment-16277710 ] Taylor Edmiston commented on SPARK-20368: - I also posted this on the PR linked in the comment above, but I'd like to inquire about the status of this PR. Is it something that could be merged? Exception aggregation with Sentry in Python is such a common feature, and it's something I really need as well. I'd be happy to jump in and help push this over the finish line if possible. > Support Sentry on PySpark workers > - > > Key: SPARK-20368 > URL: https://issues.apache.org/jira/browse/SPARK-20368 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Alexander Shorin > > [Setry|https://sentry.io] is a well known among Python developers system to > capture, classify, track and explain tracebacks, helping people better > understand what went wrong, how to reproduce the issue and fix it. > Any Spark application on Python is actually divided on two parts: > 1. The one that runs on "driver side". That part user may control in all the > ways it want and provide reports to Sentry is very easy to do here. > 2. The one that runs on executors. That's Python UDFs and the rest > transformation functions. Unfortunately, here we cannot provide such kind of > feature. And that is the part this feature is about. > In order to simplify developing experience, it would be nice to have optional > Sentry support on PySpark worker level. > What this feature could looks like? > 1. PySpark will have new extra named {{sentry}} which installs Sentry client > and the rest required things if are necessary. This is an optional > install-time dependency. > 2. PySpark worker will be able to detect presence of Sentry support and send > error reports there. > 3. All configuration of Sentry could and will be done via standard Sentry`s > environment variables. > What this feature will give to users? > 1. Better exceptions in Sentry. From driver-side application, now all of them > get recorded as like `Py4JJavaError` where the real executor exception is > written in a traceback body. > 2. Greater simplification of understanding context when thing went wrong and > why. > 3. Simplify Python UDFs debug and issues reproduce. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters
[ https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277615#comment-16277615 ] Li Jin commented on SPARK-21187: Gotcha. Thanks! > Complete support for remaining Spark data types in Arrow Converters > --- > > Key: SPARK-21187 > URL: https://issues.apache.org/jira/browse/SPARK-21187 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > This is to track adding the remaining type support in Arrow Converters. > Currently, only primitive data types are supported. ' > Remaining types: > * -*Date*- > * -*Timestamp*- > * *Complex*: Struct, Array, Map > * *Decimal* > Some things to do before closing this out: > * Look to upgrading to Arrow 0.7 for better Decimal support (can now write > values as BigDecimal) > * Need to add some user docs > * Make sure Python tests are thorough > * Check into complex type support mentioned in comments by [~leif], should we > support mulit-indexing? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses
[ https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277577#comment-16277577 ] Hyukjin Kwon commented on SPARK-22674: -- Basically yes, for now. I think we should avoid having a different change alone in PySpark anymore to reduce overhead, for example, maintaianing, reviewing costs, etc. in general. Performance measurement should also be a good step for it before we decide to go ahead. > PySpark breaks serialization of namedtuple subclasses > - > > Key: SPARK-22674 > URL: https://issues.apache.org/jira/browse/SPARK-22674 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Jonas Amrich > > Pyspark monkey patches the namedtuple class to make it serializable, however > this breaks serialization of its subclasses. With current implementation, any > subclass will be serialized (and deserialized) as it's parent namedtuple. > Consider this code, which will fail with {{AttributeError: 'Point' object has > no attribute 'sum'}}: > {code} > from collections import namedtuple > Point = namedtuple("Point", "x y") > class PointSubclass(Point): > def sum(self): > return self.x + self.y > rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]]) > rdd.collect()[0][0].sum() > {code} > Moreover, as PySpark hijacks all namedtuples in the main module, importing > pyspark breaks serialization of namedtuple subclasses even in code which is > not related to spark / distributed execution. I don't see any clean solution > to this; a possible workaround may be to limit serialization hack only to > direct namedtuple subclasses like in > https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22684) Avoid the generation of useless mutable states by datetime functions
[ https://issues.apache.org/jira/browse/SPARK-22684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22684: Assignee: Apache Spark > Avoid the generation of useless mutable states by datetime functions > > > Key: SPARK-22684 > URL: https://issues.apache.org/jira/browse/SPARK-22684 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Apache Spark > > Some datetime functions are defining mutable states which are not needed at > all. This is bad for the well known issues related to constant pool limits. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22684) Avoid the generation of useless mutable states by datetime functions
[ https://issues.apache.org/jira/browse/SPARK-22684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22684: Assignee: (was: Apache Spark) > Avoid the generation of useless mutable states by datetime functions > > > Key: SPARK-22684 > URL: https://issues.apache.org/jira/browse/SPARK-22684 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido > > Some datetime functions are defining mutable states which are not needed at > all. This is bad for the well known issues related to constant pool limits. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22684) Avoid the generation of useless mutable states by datetime functions
[ https://issues.apache.org/jira/browse/SPARK-22684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277559#comment-16277559 ] Apache Spark commented on SPARK-22684: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/19883 > Avoid the generation of useless mutable states by datetime functions > > > Key: SPARK-22684 > URL: https://issues.apache.org/jira/browse/SPARK-22684 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido > > Some datetime functions are defining mutable states which are not needed at > all. This is bad for the well known issues related to constant pool limits. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22672) Refactor ORC Tests
[ https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277553#comment-16277553 ] Apache Spark commented on SPARK-22672: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19882 > Refactor ORC Tests > -- > > Key: SPARK-22672 > URL: https://issues.apache.org/jira/browse/SPARK-22672 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > > Since SPARK-20682, we have two `OrcFileFormat`s. This issue refactor ORC > tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters
[ https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277549#comment-16277549 ] Bryan Cutler commented on SPARK-21187: -- Hi [~icexelloss], StructType has been added on the Java side, but still needs some work for it to be used in pyspark. It needs some of the same functions used for ArrayType, which I can submit a PR for soon, but will need to upgrade Arrow to 0.8 before it can be merged. > Complete support for remaining Spark data types in Arrow Converters > --- > > Key: SPARK-21187 > URL: https://issues.apache.org/jira/browse/SPARK-21187 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > This is to track adding the remaining type support in Arrow Converters. > Currently, only primitive data types are supported. ' > Remaining types: > * -*Date*- > * -*Timestamp*- > * *Complex*: Struct, Array, Map > * *Decimal* > Some things to do before closing this out: > * Look to upgrading to Arrow 0.7 for better Decimal support (can now write > values as BigDecimal) > * Need to add some user docs > * Make sure Python tests are thorough > * Check into complex type support mentioned in comments by [~leif], should we > support mulit-indexing? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22684) Avoid the generation of useless mutable states by datetime functions
Marco Gaido created SPARK-22684: --- Summary: Avoid the generation of useless mutable states by datetime functions Key: SPARK-22684 URL: https://issues.apache.org/jira/browse/SPARK-22684 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Marco Gaido Some datetime functions are defining mutable states which are not needed at all. This is bad for the well known issues related to constant pool limits. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22672) Refactor ORC Tests
[ https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22672: -- Description: Since SPARK-20682, we have two `OrcFileFormat`s. This issue refactor ORC tests. (was: To support ORC tests without Hive, we had better have `OrcTest` in `sql/core` instead of `sql/hive`.) > Refactor ORC Tests > -- > > Key: SPARK-22672 > URL: https://issues.apache.org/jira/browse/SPARK-22672 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > > Since SPARK-20682, we have two `OrcFileFormat`s. This issue refactor ORC > tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22672) Refactor ORC Tests
[ https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22672: -- Summary: Refactor ORC Tests (was: Move OrcTest to `sql/core`) > Refactor ORC Tests > -- > > Key: SPARK-22672 > URL: https://issues.apache.org/jira/browse/SPARK-22672 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Trivial > > To support ORC tests without Hive, we had better have `OrcTest` in `sql/core` > instead of `sql/hive`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22672) Refactor ORC Tests
[ https://issues.apache.org/jira/browse/SPARK-22672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22672: -- Priority: Major (was: Trivial) > Refactor ORC Tests > -- > > Key: SPARK-22672 > URL: https://issues.apache.org/jira/browse/SPARK-22672 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > > To support ORC tests without Hive, we had better have `OrcTest` in `sql/core` > instead of `sql/hive`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses
[ https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277424#comment-16277424 ] Jonas Amrich commented on SPARK-22674: -- Sure, you're right that pickle won't unpickle it without class definition. As far as I know PySpark uses pickle serializer as default and the hijack is there to enable namedtuple pickling and unpickling with regular pickle. Do you propose removing the hijack? Removing it would mean that regular pickle won't be able to unpickle namedtuples anymore. And therefore cloudpickle will have to be used as default, which is quite big change (and IMHO not very good for performance). > PySpark breaks serialization of namedtuple subclasses > - > > Key: SPARK-22674 > URL: https://issues.apache.org/jira/browse/SPARK-22674 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Jonas Amrich > > Pyspark monkey patches the namedtuple class to make it serializable, however > this breaks serialization of its subclasses. With current implementation, any > subclass will be serialized (and deserialized) as it's parent namedtuple. > Consider this code, which will fail with {{AttributeError: 'Point' object has > no attribute 'sum'}}: > {code} > from collections import namedtuple > Point = namedtuple("Point", "x y") > class PointSubclass(Point): > def sum(self): > return self.x + self.y > rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]]) > rdd.collect()[0][0].sum() > {code} > Moreover, as PySpark hijacks all namedtuples in the main module, importing > pyspark breaks serialization of namedtuple subclasses even in code which is > not related to spark / distributed execution. I don't see any clean solution > to this; a possible workaround may be to limit serialization hack only to > direct namedtuple subclasses like in > https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22372) Make YARN client extend SparkApplication
[ https://issues.apache.org/jira/browse/SPARK-22372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22372. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19631 [https://github.com/apache/spark/pull/19631] > Make YARN client extend SparkApplication > > > Key: SPARK-22372 > URL: https://issues.apache.org/jira/browse/SPARK-22372 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > Fix For: 2.3.0 > > > For SPARK-11035 to work well, at least in cluster mode, YARN needs to > implement {{SparkApplication}} so that it doesn't use system properties to > propagate Spark configuration from spark-submit. > There is a second complication, that YARN uses system properties to propagate > {{SPARK_YARN_MODE}} on top of other Spark configs. We should take a look at > either change that to a configuration, or remove {{SPARK_YARN_MODE}} > altogether if possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters
[ https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277271#comment-16277271 ] Li Jin commented on SPARK-21187: [~bryanc] Thanks for the update! Is there any thing particular needs to be done for StructType? Seems it has been handled: https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java#L318 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala#L63 > Complete support for remaining Spark data types in Arrow Converters > --- > > Key: SPARK-21187 > URL: https://issues.apache.org/jira/browse/SPARK-21187 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > This is to track adding the remaining type support in Arrow Converters. > Currently, only primitive data types are supported. ' > Remaining types: > * -*Date*- > * -*Timestamp*- > * *Complex*: Struct, Array, Map > * *Decimal* > Some things to do before closing this out: > * Look to upgrading to Arrow 0.7 for better Decimal support (can now write > values as BigDecimal) > * Need to add some user docs > * Make sure Python tests are thorough > * Check into complex type support mentioned in comments by [~leif], should we > support mulit-indexing? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Cuquemelle updated SPARK-22683: -- Labels: pull-request-available (was: ) Description: let's say an executor has spark.executor.cores / spark.task.cpus taskSlots The current dynamic allocation policy allocates enough executors to have each taskSlot execute a single task, which minimizes latency, but wastes resources when tasks are small regarding executor allocation overhead. By adding the tasksPerExecutorSlot, it is made possible to specify how many tasks a single slot should ideally execute to mitigate the overhead of executor allocation. PR: https://github.com/apache/spark/pull/19881 was: let's say an executor has spark.executor.cores / spark.task.cpus taskSlots The current dynamic allocation policy allocates enough executors to have each taskSlot execute a single task, which minimizes latency, but wastes resources when tasks are small regarding executor allocation overhead. By adding the tasksPerExecutorSlot, it is made possible to specify how many tasks a single slot should ideally execute to mitigate the overhead of executor allocation. > Allow tuning the number of dynamically allocated executors wrt task number > -- > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > Labels: pull-request-available > > let's say an executor has spark.executor.cores / spark.task.cpus taskSlots > The current dynamic allocation policy allocates enough executors > to have each taskSlot execute a single task, which minimizes latency, > but wastes resources when tasks are small regarding executor allocation > overhead. > By adding the tasksPerExecutorSlot, it is made possible to specify how many > tasks > a single slot should ideally execute to mitigate the overhead of executor > allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277187#comment-16277187 ] Apache Spark commented on SPARK-22683: -- User 'jcuquemelle' has created a pull request for this issue: https://github.com/apache/spark/pull/19881 > Allow tuning the number of dynamically allocated executors wrt task number > -- > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > > let's say an executor has spark.executor.cores / spark.task.cpus taskSlots > The current dynamic allocation policy allocates enough executors > to have each taskSlot execute a single task, which minimizes latency, > but wastes resources when tasks are small regarding executor allocation > overhead. > By adding the tasksPerExecutorSlot, it is made possible to specify how many > tasks > a single slot should ideally execute to mitigate the overhead of executor > allocation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22683: Assignee: (was: Apache Spark) > Allow tuning the number of dynamically allocated executors wrt task number > -- > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > > let's say an executor has spark.executor.cores / spark.task.cpus taskSlots > The current dynamic allocation policy allocates enough executors > to have each taskSlot execute a single task, which minimizes latency, > but wastes resources when tasks are small regarding executor allocation > overhead. > By adding the tasksPerExecutorSlot, it is made possible to specify how many > tasks > a single slot should ideally execute to mitigate the overhead of executor > allocation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22683: Assignee: Apache Spark > Allow tuning the number of dynamically allocated executors wrt task number > -- > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Assignee: Apache Spark > > let's say an executor has spark.executor.cores / spark.task.cpus taskSlots > The current dynamic allocation policy allocates enough executors > to have each taskSlot execute a single task, which minimizes latency, > but wastes resources when tasks are small regarding executor allocation > overhead. > By adding the tasksPerExecutorSlot, it is made possible to specify how many > tasks > a single slot should ideally execute to mitigate the overhead of executor > allocation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Cuquemelle updated SPARK-22683: -- Priority: Major (was: Minor) > Allow tuning the number of dynamically allocated executors wrt task number > -- > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > > let's say an executor has spark.executor.cores / spark.task.cpus taskSlots > The current dynamic allocation policy allocates enough executors > to have each taskSlot execute a single task, which minimizes latency, > but wastes resources when tasks are small regarding executor allocation > overhead. > By adding the tasksPerExecutorSlot, it is made possible to specify how many > tasks > a single slot should ideally execute to mitigate the overhead of executor > allocation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22683: -- Target Version/s: (was: 2.1.1, 2.2.0) The overhead of small tasks doesn't change if you over-commit tasks with respect to task slots. I think this isn't really a solution, and the app needs to look at ways to make fewer, larger tasks. There's overhead to adding yet another knob to turn here, and its interaction with other settings isn't obvious. This concept isn't present elsewhere in Spark. You will also kind of get this effect anyway; if tasks are finishing very quickly, and locality wait is at all positive, you'll find tasks tend to favor older executors with cached data, and the newer ones, dynamically allocated, may get few or no tasks and deallocate anyway. Allocation only happens when the task backlog builds up. > Allow tuning the number of dynamically allocated executors wrt task number > -- > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Minor > > let's say an executor has spark.executor.cores / spark.task.cpus taskSlots > The current dynamic allocation policy allocates enough executors > to have each taskSlot execute a single task, which minimizes latency, > but wastes resources when tasks are small regarding executor allocation > overhead. > By adding the tasksPerExecutorSlot, it is made possible to specify how many > tasks > a single slot should ideally execute to mitigate the overhead of executor > allocation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22683) Allow tuning the number of dynamically allocated executors wrt task number
Julien Cuquemelle created SPARK-22683: - Summary: Allow tuning the number of dynamically allocated executors wrt task number Key: SPARK-22683 URL: https://issues.apache.org/jira/browse/SPARK-22683 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0, 2.1.0 Reporter: Julien Cuquemelle Priority: Minor let's say an executor has spark.executor.cores / spark.task.cpus taskSlots The current dynamic allocation policy allocates enough executors to have each taskSlot execute a single task, which minimizes latency, but wastes resources when tasks are small regarding executor allocation overhead. By adding the tasksPerExecutorSlot, it is made possible to specify how many tasks a single slot should ideally execute to mitigate the overhead of executor allocation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22162) Executors and the driver use inconsistent Job IDs during the new RDD commit protocol
[ https://issues.apache.org/jira/browse/SPARK-22162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22162. Resolution: Fixed Assignee: Reza Safi Fix Version/s: 2.3.0 > Executors and the driver use inconsistent Job IDs during the new RDD commit > protocol > > > Key: SPARK-22162 > URL: https://issues.apache.org/jira/browse/SPARK-22162 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Reza Safi >Assignee: Reza Safi > Fix For: 2.3.0 > > > After SPARK-18191 commit in pull request 15769, using the new commit protocol > it is possible that driver and executors uses different jobIds during a rdd > commit. > In the old code, the variable stageId is part of the closure used to define > the task as you can see here: > > [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1098] > As a result, a TaskAttemptId is constructed in executors using the same > "stageId" as the driver, since it is a value that is serialized in the > driver. Also the value of stageID is actually the rdd.id which is assigned > here: > [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1084] > However, after the change in pull request 15769, the value is no longer part > of the task closure, which gets serialized by the driver. Instead, it is > pulled from the taskContext as you can see > here:[https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R103] > and then that value is used to construct the TaskAttemptId on the executors: > [https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R134] > taskContext has a stageID value which will be set in DAGScheduler. So after > the change unlike the old code which a rdd.id was used, an actual stage.id is > used which can be different between executors and the driver since it is no > longer serialized. > In summary, the old code consistently used rddId, and just incorrectly named > it "stageId". > The new code uses a mix of rddId and stageId. There should be a consistent ID > between executors and the drivers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22162) Executors and the driver use inconsistent Job IDs during the new RDD commit protocol
[ https://issues.apache.org/jira/browse/SPARK-22162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-22162: --- Affects Version/s: (was: 2.3.0) > Executors and the driver use inconsistent Job IDs during the new RDD commit > protocol > > > Key: SPARK-22162 > URL: https://issues.apache.org/jira/browse/SPARK-22162 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Reza Safi >Assignee: Reza Safi > Fix For: 2.3.0 > > > After SPARK-18191 commit in pull request 15769, using the new commit protocol > it is possible that driver and executors uses different jobIds during a rdd > commit. > In the old code, the variable stageId is part of the closure used to define > the task as you can see here: > > [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1098] > As a result, a TaskAttemptId is constructed in executors using the same > "stageId" as the driver, since it is a value that is serialized in the > driver. Also the value of stageID is actually the rdd.id which is assigned > here: > [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1084] > However, after the change in pull request 15769, the value is no longer part > of the task closure, which gets serialized by the driver. Instead, it is > pulled from the taskContext as you can see > here:[https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R103] > and then that value is used to construct the TaskAttemptId on the executors: > [https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R134] > taskContext has a stageID value which will be set in DAGScheduler. So after > the change unlike the old code which a rdd.id was used, an actual stage.id is > used which can be different between executors and the driver since it is no > longer serialized. > In summary, the old code consistently used rddId, and just incorrectly named > it "stageId". > The new code uses a mix of rddId and stageId. There should be a consistent ID > between executors and the drivers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22626) Wrong Hive table statistics may trigger OOM if enables CBO
[ https://issues.apache.org/jira/browse/SPARK-22626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276988#comment-16276988 ] Apache Spark commented on SPARK-22626: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/19880 > Wrong Hive table statistics may trigger OOM if enables CBO > -- > > Key: SPARK-22626 > URL: https://issues.apache.org/jira/browse/SPARK-22626 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Fix For: 2.3.0 > > > How to reproduce: > {code} > bin/spark-shell --conf spark.sql.cbo.enabled=true > {code} > {code:java} > import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec > spark.sql("CREATE TABLE small (c1 bigint) TBLPROPERTIES ('numRows'='3', > 'rawDataSize'='600','totalSize'='800')") > // Big table with wrong statistics, numRows=0 > spark.sql("CREATE TABLE big (c1 bigint) TBLPROPERTIES ('numRows'='0', > 'rawDataSize'='600', 'totalSize'='8')") > val plan = spark.sql("select * from small t1 join big t2 on (t1.c1 = > t2.c1)").queryExecution.executedPlan > val buildSide = > plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide > println(buildSide) > {code} > The result is {{BuildRight}}, but the right side is the big table. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20706) Spark-shell not overriding method/variable definition
[ https://issues.apache.org/jira/browse/SPARK-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20706: Assignee: (was: Apache Spark) > Spark-shell not overriding method/variable definition > - > > Key: SPARK-20706 > URL: https://issues.apache.org/jira/browse/SPARK-20706 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0, 2.1.1, 2.2.0 > Environment: Linux, Scala 2.11.8 >Reporter: Raphael Roth > Attachments: screenshot-1.png > > > !screenshot-1.png!In the following example, the definition of myMethod is not > correctly updated: > -- > def myMethod() = "first definition" > val tmp = myMethod(); val out = tmp > println(out) // prints "first definition" > def myMethod() = "second definition" // override above myMethod > val tmp = myMethod(); val out = tmp > println(out) // should be "second definition" but is "first definition" > -- > I'm using semicolon to force two statements to be compiled at the same time. > It's also possible to reproduce the behavior using :paste > So if I-redefine myMethod, the implementation seems not to be updated in this > case. I figured out that the second-last statement (val out = tmp) causes > this behavior, if this is moved in a separate block, the code works just fine. > EDIT: > The same behavior can be seen when declaring variables : > -- > val a = 1 > val b = a; val c = b; > println(b) // prints "1" > val a = 2 // override a > val b = a; val c = b; > println(b) // prints "1" instead of "2" > -- > Interestingly, if the second-last line "val b = a; val c = b;" is executed > twice, then I get the expected result -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20706) Spark-shell not overriding method/variable definition
[ https://issues.apache.org/jira/browse/SPARK-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276948#comment-16276948 ] Apache Spark commented on SPARK-20706: -- User 'mpetruska' has created a pull request for this issue: https://github.com/apache/spark/pull/19879 > Spark-shell not overriding method/variable definition > - > > Key: SPARK-20706 > URL: https://issues.apache.org/jira/browse/SPARK-20706 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0, 2.1.1, 2.2.0 > Environment: Linux, Scala 2.11.8 >Reporter: Raphael Roth > Attachments: screenshot-1.png > > > !screenshot-1.png!In the following example, the definition of myMethod is not > correctly updated: > -- > def myMethod() = "first definition" > val tmp = myMethod(); val out = tmp > println(out) // prints "first definition" > def myMethod() = "second definition" // override above myMethod > val tmp = myMethod(); val out = tmp > println(out) // should be "second definition" but is "first definition" > -- > I'm using semicolon to force two statements to be compiled at the same time. > It's also possible to reproduce the behavior using :paste > So if I-redefine myMethod, the implementation seems not to be updated in this > case. I figured out that the second-last statement (val out = tmp) causes > this behavior, if this is moved in a separate block, the code works just fine. > EDIT: > The same behavior can be seen when declaring variables : > -- > val a = 1 > val b = a; val c = b; > println(b) // prints "1" > val a = 2 // override a > val b = a; val c = b; > println(b) // prints "1" instead of "2" > -- > Interestingly, if the second-last line "val b = a; val c = b;" is executed > twice, then I get the expected result -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20706) Spark-shell not overriding method/variable definition
[ https://issues.apache.org/jira/browse/SPARK-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20706: Assignee: Apache Spark > Spark-shell not overriding method/variable definition > - > > Key: SPARK-20706 > URL: https://issues.apache.org/jira/browse/SPARK-20706 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0, 2.1.1, 2.2.0 > Environment: Linux, Scala 2.11.8 >Reporter: Raphael Roth >Assignee: Apache Spark > Attachments: screenshot-1.png > > > !screenshot-1.png!In the following example, the definition of myMethod is not > correctly updated: > -- > def myMethod() = "first definition" > val tmp = myMethod(); val out = tmp > println(out) // prints "first definition" > def myMethod() = "second definition" // override above myMethod > val tmp = myMethod(); val out = tmp > println(out) // should be "second definition" but is "first definition" > -- > I'm using semicolon to force two statements to be compiled at the same time. > It's also possible to reproduce the behavior using :paste > So if I-redefine myMethod, the implementation seems not to be updated in this > case. I figured out that the second-last statement (val out = tmp) causes > this behavior, if this is moved in a separate block, the code works just fine. > EDIT: > The same behavior can be seen when declaring variables : > -- > val a = 1 > val b = a; val c = b; > println(b) // prints "1" > val a = 2 // override a > val b = a; val c = b; > println(b) // prints "1" instead of "2" > -- > Interestingly, if the second-last line "val b = a; val c = b;" is executed > twice, then I get the expected result -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22682) HashExpression does not need to create global variables
[ https://issues.apache.org/jira/browse/SPARK-22682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22682: Assignee: Wenchen Fan (was: Apache Spark) > HashExpression does not need to create global variables > --- > > Key: SPARK-22682 > URL: https://issues.apache.org/jira/browse/SPARK-22682 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22682) HashExpression does not need to create global variables
[ https://issues.apache.org/jira/browse/SPARK-22682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276945#comment-16276945 ] Apache Spark commented on SPARK-22682: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19878 > HashExpression does not need to create global variables > --- > > Key: SPARK-22682 > URL: https://issues.apache.org/jira/browse/SPARK-22682 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22682) HashExpression does not need to create global variables
[ https://issues.apache.org/jira/browse/SPARK-22682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22682: Assignee: Apache Spark (was: Wenchen Fan) > HashExpression does not need to create global variables > --- > > Key: SPARK-22682 > URL: https://issues.apache.org/jira/browse/SPARK-22682 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22682) HashExpression does not need to create global variables
Wenchen Fan created SPARK-22682: --- Summary: HashExpression does not need to create global variables Key: SPARK-22682 URL: https://issues.apache.org/jira/browse/SPARK-22682 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20706) Spark-shell not overriding method/variable definition
[ https://issues.apache.org/jira/browse/SPARK-20706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276921#comment-16276921 ] Mark Petruska commented on SPARK-20706: --- This is a Scala repl bug, see: https://github.com/scala/bug/issues/9740. The fix for this made it into Scala 2.11.9. Basically it affects "class-based" Scala-shells, which is used in Spark-shell. Creating the PR for the fix. > Spark-shell not overriding method/variable definition > - > > Key: SPARK-20706 > URL: https://issues.apache.org/jira/browse/SPARK-20706 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0, 2.1.1, 2.2.0 > Environment: Linux, Scala 2.11.8 >Reporter: Raphael Roth > Attachments: screenshot-1.png > > > !screenshot-1.png!In the following example, the definition of myMethod is not > correctly updated: > -- > def myMethod() = "first definition" > val tmp = myMethod(); val out = tmp > println(out) // prints "first definition" > def myMethod() = "second definition" // override above myMethod > val tmp = myMethod(); val out = tmp > println(out) // should be "second definition" but is "first definition" > -- > I'm using semicolon to force two statements to be compiled at the same time. > It's also possible to reproduce the behavior using :paste > So if I-redefine myMethod, the implementation seems not to be updated in this > case. I figured out that the second-last statement (val out = tmp) causes > this behavior, if this is moved in a separate block, the code works just fine. > EDIT: > The same behavior can be seen when declaring variables : > -- > val a = 1 > val b = a; val c = b; > println(b) // prints "1" > val a = 2 // override a > val b = a; val c = b; > println(b) // prints "1" instead of "2" > -- > Interestingly, if the second-last line "val b = a; val c = b;" is executed > twice, then I get the expected result -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276892#comment-16276892 ] Sasaki Toru commented on SPARK-20050: - Thank you comment. I think this patch can be backported to branch-2.1 and will fix same issue. > Kafka 0.10 DirectStream doesn't commit last processed batch's offset when > graceful shutdown > --- > > Key: SPARK-20050 > URL: https://issues.apache.org/jira/browse/SPARK-20050 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru > > I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and > call 'DirectKafkaInputDStream#commitAsync' finally in each batches, such > below > {code} > val kafkaStream = KafkaUtils.createDirectStream[String, String](...) > kafkaStream.map { input => > "key: " + input.key.toString + " value: " + input.value.toString + " > offset: " + input.offset.toString > }.foreachRDD { rdd => > rdd.foreach { input => > println(input) > } > } > kafkaStream.foreachRDD { rdd => > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) > } > {\code} > Some records which processed in the last batch before Streaming graceful > shutdown reprocess in the first batch after Spark Streaming restart, such > below > * output first run of this application > {code} > key: null value: 1 offset: 101452472 > key: null value: 2 offset: 101452473 > key: null value: 3 offset: 101452474 > key: null value: 4 offset: 101452475 > key: null value: 5 offset: 101452476 > key: null value: 6 offset: 101452477 > key: null value: 7 offset: 101452478 > key: null value: 8 offset: 101452479 > key: null value: 9 offset: 101452480 // this is a last record before > shutdown Spark Streaming gracefully > {\code} > * output re-run of this application > {code} > key: null value: 7 offset: 101452478 // duplication > key: null value: 8 offset: 101452479 // duplication > key: null value: 9 offset: 101452480 // duplication > key: null value: 10 offset: 101452481 > {\code} > It may cause offsets specified in commitAsync will commit in the head of next > batch. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276892#comment-16276892 ] Sasaki Toru edited comment on SPARK-20050 at 12/4/17 2:54 PM: -- Thank you comment. I think this patch can be backported to branch-2.1 and will fix same issue in version 2.1. was (Author: sasakitoa): Thank you comment. I think this patch can be backported to branch-2.1 and will fix same issue. > Kafka 0.10 DirectStream doesn't commit last processed batch's offset when > graceful shutdown > --- > > Key: SPARK-20050 > URL: https://issues.apache.org/jira/browse/SPARK-20050 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru > > I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and > call 'DirectKafkaInputDStream#commitAsync' finally in each batches, such > below > {code} > val kafkaStream = KafkaUtils.createDirectStream[String, String](...) > kafkaStream.map { input => > "key: " + input.key.toString + " value: " + input.value.toString + " > offset: " + input.offset.toString > }.foreachRDD { rdd => > rdd.foreach { input => > println(input) > } > } > kafkaStream.foreachRDD { rdd => > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) > } > {\code} > Some records which processed in the last batch before Streaming graceful > shutdown reprocess in the first batch after Spark Streaming restart, such > below > * output first run of this application > {code} > key: null value: 1 offset: 101452472 > key: null value: 2 offset: 101452473 > key: null value: 3 offset: 101452474 > key: null value: 4 offset: 101452475 > key: null value: 5 offset: 101452476 > key: null value: 6 offset: 101452477 > key: null value: 7 offset: 101452478 > key: null value: 8 offset: 101452479 > key: null value: 9 offset: 101452480 // this is a last record before > shutdown Spark Streaming gracefully > {\code} > * output re-run of this application > {code} > key: null value: 7 offset: 101452478 // duplication > key: null value: 8 offset: 101452479 // duplication > key: null value: 9 offset: 101452480 // duplication > key: null value: 10 offset: 101452481 > {\code} > It may cause offsets specified in commitAsync will commit in the head of next > batch. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1940) Enable rolling of executor logs (stdout / stderr)
[ https://issues.apache.org/jira/browse/SPARK-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276759#comment-16276759 ] Apache Spark commented on SPARK-1940: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/895 > Enable rolling of executor logs (stdout / stderr) > - > > Key: SPARK-1940 > URL: https://issues.apache.org/jira/browse/SPARK-1940 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.1.0 > > > Currently, in the default log4j configuration, all the executor logs get sent > to the file [executor-working-dir]/stderr. This does not all log > files to be rolled, so old logs cannot be removed. > Using log4j RollingFileAppender allows log4j logs to be rolled, but all the > logs get sent to a different set of files, other than the files > stdout and stderr . So the logs are not visible in > the Spark web UI any more as Spark web UI only reads the files > stdout and stderr. Furthermore, it still does not > allow the stdout and stderr to be cleared periodically in case a large amount > of stuff gets written to them (e.g. by explicit println inside map function). > Solving this requires rolling of the logs in such a way that Spark web UI is > aware of it and can retrieve the logs across the rolled-over files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22681) Accumulator should only be updated once for each task in result stage
[ https://issues.apache.org/jira/browse/SPARK-22681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22681: Assignee: Apache Spark > Accumulator should only be updated once for each task in result stage > - > > Key: SPARK-22681 > URL: https://issues.apache.org/jira/browse/SPARK-22681 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Carson Wang >Assignee: Apache Spark > > As the doc says "For accumulator updates performed inside actions only, Spark > guarantees that each task’s update to the accumulator will only be applied > once, i.e. restarted tasks will not update the value." > But currently the code doesn't guarantee this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22681) Accumulator should only be updated once for each task in result stage
[ https://issues.apache.org/jira/browse/SPARK-22681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276718#comment-16276718 ] Apache Spark commented on SPARK-22681: -- User 'carsonwang' has created a pull request for this issue: https://github.com/apache/spark/pull/19877 > Accumulator should only be updated once for each task in result stage > - > > Key: SPARK-22681 > URL: https://issues.apache.org/jira/browse/SPARK-22681 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Carson Wang > > As the doc says "For accumulator updates performed inside actions only, Spark > guarantees that each task’s update to the accumulator will only be applied > once, i.e. restarted tasks will not update the value." > But currently the code doesn't guarantee this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22681) Accumulator should only be updated once for each task in result stage
[ https://issues.apache.org/jira/browse/SPARK-22681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22681: Assignee: (was: Apache Spark) > Accumulator should only be updated once for each task in result stage > - > > Key: SPARK-22681 > URL: https://issues.apache.org/jira/browse/SPARK-22681 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Carson Wang > > As the doc says "For accumulator updates performed inside actions only, Spark > guarantees that each task’s update to the accumulator will only be applied > once, i.e. restarted tasks will not update the value." > But currently the code doesn't guarantee this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22681) Accumulator should only be updated once for each task in result stage
Carson Wang created SPARK-22681: --- Summary: Accumulator should only be updated once for each task in result stage Key: SPARK-22681 URL: https://issues.apache.org/jira/browse/SPARK-22681 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Carson Wang As the doc says "For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value." But currently the code doesn't guarantee this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22680) SparkSQL scan all partitions when the specified partitions are not exists in parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-22680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochen Ouyang updated SPARK-22680: Summary: SparkSQL scan all partitions when the specified partitions are not exists in parquet formatted table (was: SparkSQL scan all partitions when specified partition is not exists in parquet formatted table) > SparkSQL scan all partitions when the specified partitions are not exists in > parquet formatted table > > > Key: SPARK-22680 > URL: https://issues.apache.org/jira/browse/SPARK-22680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 > Environment: spark2.0.2 spark2.2.0 >Reporter: Xiaochen Ouyang > > 1. spark-sql --master local[2] > 2. create external table test (id int,name string) partitioned by (country > string,province string, day string,hour int) stored as parquet localtion > '/warehouse/test'; > 3.produce data into table test > 4. select count(1) from test where country = '185' and province = '021' and > day = '2017-11-12' and hour = 10; if the 4 filter conditions are not exists > in HDFS and MetaStore[mysql] , this sql will scan all partitions in table test -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22680) SparkSQL scan all partitions when specified partition is not exists in parquet formatted table
Xiaochen Ouyang created SPARK-22680: --- Summary: SparkSQL scan all partitions when specified partition is not exists in parquet formatted table Key: SPARK-22680 URL: https://issues.apache.org/jira/browse/SPARK-22680 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.0.2 Environment: spark2.0.2 spark2.2.0 Reporter: Xiaochen Ouyang 1. spark-sql --master local[2] 2. create external table test (id int,name string) partitioned by (country string,province string, day string,hour int) stored as parquet localtion '/warehouse/test'; 3.produce data into table test 4. select count(1) from test where country = '185' and province = '021' and day = '2017-11-12' and hour = 10; if the 4 filter conditions are not exists in HDFS and MetaStore[mysql] , this sql will scan all partitions in table test -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11239) PMML export for ML linear regression
[ https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11239: Assignee: Apache Spark > PMML export for ML linear regression > > > Key: SPARK-11239 > URL: https://issues.apache.org/jira/browse/SPARK-11239 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: holdenk >Assignee: Apache Spark > > Add PMML export for linear regression models form the ML pipeline. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11239) PMML export for ML linear regression
[ https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276643#comment-16276643 ] Apache Spark commented on SPARK-11239: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/19876 > PMML export for ML linear regression > > > Key: SPARK-11239 > URL: https://issues.apache.org/jira/browse/SPARK-11239 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: holdenk > > Add PMML export for linear regression models form the ML pipeline. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11239) PMML export for ML linear regression
[ https://issues.apache.org/jira/browse/SPARK-11239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11239: Assignee: (was: Apache Spark) > PMML export for ML linear regression > > > Key: SPARK-11239 > URL: https://issues.apache.org/jira/browse/SPARK-11239 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: holdenk > > Add PMML export for linear regression models form the ML pipeline. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11171) PMML for Pipelines API
[ https://issues.apache.org/jira/browse/SPARK-11171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276640#comment-16276640 ] Apache Spark commented on SPARK-11171: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/19876 > PMML for Pipelines API > -- > > Key: SPARK-11171 > URL: https://issues.apache.org/jira/browse/SPARK-11171 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: holdenk > > We need to add PMML export to the spark.ml Pipelines API. > We should make 1 subtask JIRA per model. Hopefully we can reuse the > underlying implementation, adding simple wrappers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22473) Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date
[ https://issues.apache.org/jira/browse/SPARK-22473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276629#comment-16276629 ] Apache Spark commented on SPARK-22473: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/19875 > Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date > -- > > Key: SPARK-22473 > URL: https://issues.apache.org/jira/browse/SPARK-22473 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Trivial > Fix For: 2.3.0 > > > In `spark-sql` module tests there are deprecations warnings caused by the > usage of deprecated methods of `java.sql.Date` and the usage of the > deprecated `AsyncAssertions.Waiter` class. > This issue is to track their replacement with their respective non-deprecated > versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22365) Spark UI executors empty list with 500 error
[ https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276618#comment-16276618 ] bruce xu commented on SPARK-22365: -- Hi [~dubovsky]. Glad to have your response. I met this issue by using Spark ThriftServer for jdbc service and the spark version is spark 2.2.1-rc1. And I will also try to find the reason. Maybe it's a bug anyway. > Spark UI executors empty list with 500 error > > > Key: SPARK-22365 > URL: https://issues.apache.org/jira/browse/SPARK-22365 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: Jakub Dubovsky > Attachments: spark-executor-500error.png > > > No data loaded on "execturos" tab in sparkUI with stack trace below. Apart > from exception I have nothing more. But if I can test something to make this > easier to resolve I am happy to help. > {code} > java.lang.NullPointerException > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:524) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22660) Compile with scala-2.12 and JDK9
[ https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276583#comment-16276583 ] Sean Owen commented on SPARK-22660: --- You keep changing what this JIRA is about . There are too many JDK 9 issues for one. Please change this to match the scope of the PR you opened. After that identify another logical change or fix. But here you are already Hadoop 2 won't work with Java 9. > Compile with scala-2.12 and JDK9 > > > Key: SPARK-22660 > URL: https://issues.apache.org/jira/browse/SPARK-22660 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang >Priority: Minor > > build with scala-2.12 with following steps > 1. change the pom.xml with scala-2.12 > ./dev/change-scala-version.sh 2.12 > 2.build with -Pscala-2.12 > for hive on spark > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn > -Pparquet-provided -Dhadoop.version=2.7.3 > {code} > for spark sql > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn -Phive > -Dhadoop.version=2.7.3>log.sparksql 2>&1 > {code} > get following error > #Error1 > {code} > /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: > error: cannot find symbol > Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); > {code} > This is because sun.misc.Cleaner has been moved to new location in JDK9. > HADOOP-12760 will be the long term fix > #Error2 > {code} > spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: > ambiguous reference to overloaded definition, method limit in class > ByteBuffer of type (x$1: Int)java.nio.ByteBuffer > method limit in class Buffer of type ()Int > match expected type ? > val resultSize = serializedDirectResult.limit > error > {code} > The limit method was moved from ByteBuffer to the superclass Buffer and it > can no longer be called without (). The same reason for position method. > #Error3 > {code} > home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415: > ambiguous reference to overloaded definition, [error] both method putAll in > class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method > putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: > Object])Unit [error] match argument types (java.util.Map[String,String]) > [error] properties.putAll(propsMap.asJava) > [error]^ > [error] > /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427: > ambiguous reference to overloaded definition, [error] both method putAll in > class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method > putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: > Object])Unit [error] match argument types (java.util.Map[String,String]) > [error] props.putAll(outputSerdeProps.toMap.asJava) > [error] ^ > {code} > This is because the key type is Object instead of String which is unsafe. > After solving these 3 errors, compile successfully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency
[ https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276569#comment-16276569 ] Omer van Kloeten commented on SPARK-22634: -- Understandable, but since Bouncy Castle may be used by users of Spark transitively, they either evict (in which case there may be unforeseen consequences) or are using a very old version with known CVEs which may affect their code. I'd recommend including it in a maintenance release and it being prominently displayed in the release notes. > Update Bouncy castle dependency > --- > > Key: SPARK-22634 > URL: https://issues.apache.org/jira/browse/SPARK-22634 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Lior Regev >Assignee: Sean Owen >Priority: Minor > Fix For: 2.3.0 > > > Spark's usage of jets3t library as well as Spark's own Flume and Kafka > streaming uses bouncy castle version 1.51 > This is an outdated version as the latest one is 1.58 > This, in turn renders packages such as > [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds] > unusable since these require 1.58 and spark's distributions come along with > 1.51 > My own attempt was to run on EMR, and since I automatically get all of > spark's dependecies (bouncy castle 1.51 being one of them) into the > classpath, using the library to parse blockchain data failed due to missing > functionality. > I have also opened an > [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency] > with jets3t to update their dependecy as well, but along with that Spark > would have to update it's own or at least be packaged with a newer version -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22365) Spark UI executors empty list with 500 error
[ https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276564#comment-16276564 ] Jakub Dubovsky commented on SPARK-22365: In my instance it looks like it is a result of some dependency version conflict. I submit my spark using [spark notebook|https://github.com/spark-notebook/spark-notebook]. Since that is a web application as well it conflicts with spark UI somehow. I will dig deeper once this is closer to top of my back log... [~xwc3504] Thanks for posting this here! What kind of setup do you have? Do you use spark notebook as well? > Spark UI executors empty list with 500 error > > > Key: SPARK-22365 > URL: https://issues.apache.org/jira/browse/SPARK-22365 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: Jakub Dubovsky > Attachments: spark-executor-500error.png > > > No data loaded on "execturos" tab in sparkUI with stack trace below. Apart > from exception I have nothing more. But if I can test something to make this > easier to resolve I am happy to help. > {code} > java.lang.NullPointerException > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:524) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22670) Not able to create table in HIve with SparkSession when JavaSparkContext is already initialized.
[ https://issues.apache.org/jira/browse/SPARK-22670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22670. --- Resolution: Not A Problem That's an issue with the design of your app then. > Not able to create table in HIve with SparkSession when JavaSparkContext is > already initialized. > > > Key: SPARK-22670 > URL: https://issues.apache.org/jira/browse/SPARK-22670 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Naresh Meena > > Not able to create table in Hive with SparkSession when SparkContext is > already initialized. > Below is the code snippet and error logs. > JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf); > SparkSession hiveCtx = SparkSession > .builder() > > .config(HiveConf.ConfVars.METASTOREURIS.toString(), > "..:9083") > .config("spark.sql.warehouse.dir", > "/apps/hive/warehouse") > .enableHiveSupport().getOrCreate(); > 2017-11-29 13:11:33 Driver [ERROR] SparkBatchSubmitter - Failed to start the > driver for Batch_JDBC_PipelineTest > org.apache.spark.sql.AnalysisException: > Hive support is required to insert into the following tables: > `default`.`testhivedata` >;; > 'InsertIntoTable 'SimpleCatalogRelation default, CatalogTable( > Table: `default`.`testhivedata` > Created: Wed Nov 29 13:11:33 IST 2017 > Last Access: Thu Jan 01 05:29:59 IST 1970 > Type: MANAGED > Schema: [StructField(empID,LongType,true), > StructField(empDate,DateType,true), StructField(empName,StringType,true), > StructField(empSalary,DoubleType,true), > StructField(empLocation,StringType,true), > StructField(empConditions,BooleanType,true), > StructField(empCity,StringType,true), > StructField(empSystemIP,StringType,true)] > Provider: hive > Storage(Location: > file:/hadoop/yarn/local/usercache/sax/appcache/application_1511627000183_0190/container_e34_1511627000183_0190_01_01/spark-warehouse/testhivedata, > InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), > OverwriteOptions(false,Map()), false > +- LogicalRDD [empID#49L, empDate#50, empName#51, empSalary#52, > empLocation#53, empConditions#54, empCity#55, empSystemIP#56] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:405) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:76) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:263) > at > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:243) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency
[ https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276556#comment-16276556 ] Sean Owen commented on SPARK-22634: --- I'm hesitant to do that in a maintenance branch because it's a minor version change. I don't see info on CVEs relevant to Spark either. > Update Bouncy castle dependency > --- > > Key: SPARK-22634 > URL: https://issues.apache.org/jira/browse/SPARK-22634 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Lior Regev >Assignee: Sean Owen >Priority: Minor > Fix For: 2.3.0 > > > Spark's usage of jets3t library as well as Spark's own Flume and Kafka > streaming uses bouncy castle version 1.51 > This is an outdated version as the latest one is 1.58 > This, in turn renders packages such as > [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds] > unusable since these require 1.58 and spark's distributions come along with > 1.51 > My own attempt was to run on EMR, and since I automatically get all of > spark's dependecies (bouncy castle 1.51 being one of them) into the > classpath, using the library to parse blockchain data failed due to missing > functionality. > I have also opened an > [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency] > with jets3t to update their dependecy as well, but along with that Spark > would have to update it's own or at least be packaged with a newer version -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7953) Spark should cleanup output dir if job fails
[ https://issues.apache.org/jira/browse/SPARK-7953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276535#comment-16276535 ] Nandor Kollar commented on SPARK-7953: -- [~joshrosen] could you please help me with this issue, is this still an outstanding bug? It looks like Spark. 2.2 already includes SPARK-18219, and it seems that the new commit protocol calls abortJob and abortTask. > Spark should cleanup output dir if job fails > > > Key: SPARK-7953 > URL: https://issues.apache.org/jira/browse/SPARK-7953 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Mohit Sabharwal > > MR calls abortTask and abortJob on the {{OutputCommitter}} to clean up the > temporary output directories, but Spark doesn't seem to be doing that (when > outputting an RDD to a Hadoop FS) > For example: {{PairRDDFunctions.saveAsNewAPIHadoopDataset}} should call > {{committer.abortTask(hadoopContext)}} in the finally block inside the > writeShard closure. And also {{jobCommitter.abortJob(jobTaskContext, > JobStatus.State.FAILED)}} should be called if the job fails. > Additionally, MR removes the output dir if job fails, but Spark doesn't. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency
[ https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276492#comment-16276492 ] Omer van Kloeten commented on SPARK-22634: -- [~srowen], thanks for taking this up. However, this seems like more of a fix for 2.2.1 than for 2.3.0, since Bouncy Castle is a crypto library and 1.51 -> 1.58 contains fixes for numerous CVEs. > Update Bouncy castle dependency > --- > > Key: SPARK-22634 > URL: https://issues.apache.org/jira/browse/SPARK-22634 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Lior Regev >Assignee: Sean Owen >Priority: Minor > Fix For: 2.3.0 > > > Spark's usage of jets3t library as well as Spark's own Flume and Kafka > streaming uses bouncy castle version 1.51 > This is an outdated version as the latest one is 1.58 > This, in turn renders packages such as > [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds] > unusable since these require 1.58 and spark's distributions come along with > 1.51 > My own attempt was to run on EMR, and since I automatically get all of > spark's dependecies (bouncy castle 1.51 being one of them) into the > classpath, using the library to parse blockchain data failed due to missing > functionality. > I have also opened an > [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency] > with jets3t to update their dependecy as well, but along with that Spark > would have to update it's own or at least be packaged with a newer version -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22286) OutOfMemoryError caused by memory leak and large serializer batch size in ExternalAppendOnlyMap
[ https://issues.apache.org/jira/browse/SPARK-22286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijie Xu updated SPARK-22286: - Description: *[Abstract]* I recently encountered an OOM error in a simple _groupByKey_ application. After profiling the application, I found the OOM error is related to the shuffle spill and records (de)serialization. After analyzing the OOM heap dump, I found the root causes are (1) memory leak in ExternalAppendOnlyMap, (2) large static serializer batch size (_spark.shuffle.spill.batchSize_ =1) defined in ExternalAppendOnlyMap, and (3) memory leak in the deserializer. Since almost all the Spark applications rely on ExternalAppendOnlyMap to perform shuffle and reduce, this is a critical bug/defect. In the following sections, I will detail the testing application, data, environment, failure symptoms, diagnosing procedure, identified root causes, and potential solutions. *[Application]* This is a simple GroupBy application as follows. {code} table.map(row => (row.sourceIP[1,7], row)).groupByKey().saveAsTextFile() {code} The _sourceIP_ (an IP address like 127.100.101.102) is a column of the _UserVisits_ table. This application has the same logic as the aggregation query in Berkeley SQL benchmark (https://amplab.cs.berkeley.edu/benchmark/) as follows. {code} SELECT * FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7); {code} The application code is available at \[1\]. *[Data]* The UserVisits table size is 16GB (9 columns, 132,000,000 rows) with uniform distribution. The HDFS block size is 128MB. The data generator is available at \[2\]. *[Environment]* Spark 2.1 (Spark 2.2 may also have this error), Oracle Java Hotspot 1.8.0, 1 master and 8 workers as follows. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Workers.png|width=100%! This application launched 32 executors. Each executor has 1 core and 7GB memory. The detailed application configuration is {code} total-executor-cores = 32 executor-cores = 1 executor-memory = 7G spark.default.parallelism=32 spark.serializer = JavaSerializer (KryoSerializer also has OOM error) {code} *[Failure symptoms]* This application has a map stage and a reduce stage. An OOM error occurs in a reduce task (Task-17) as follows. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Stage.png|width=100%! !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Tasks.png|width=100%! Task-17 generated an OOM error. It shuffled ~1GB data and spilled 3.6GB data onto the disk. Task-17 log below shows that this task is reading the next record by invoking _ExternalAppendOnlyMap.hasNext_(). From the OOM stack traces and the above shuffle metrics, we cannot identify the OOM root causes. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/OOMStackTrace.png|width=100%! A question is that why Task-17 still suffered OOM errors even after spilling large in-memory data onto the disk. *[Diagnosing procedure]* Since each executor has 1 core and 7GB, it runs only one task at a time and the task memory usage is going to exceed 7GB. *1: Identify the error phase* I added some debug logs in Spark, and found that the error phase is not the spill phase but the memory-disk-merge phase. The memory-disk-merge phase: Spark reads back the spilled records (as shown in ① Figure 1), merges the spilled records with the in-memory records (as shown in ②), generates new records, and output the new records onto HDFS (as shown in ③). *2. Dataflow and memory usage analysis* I added some profiling code and obtained dataflow and memory usage metrics as follows. Ki represents the _i_-th key, Ri represents the _i_-th row in the table. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/DataflowAndMemoryUsage.png|width=100%! Figure 1: Dataflow and Memory Usage Analysis (see https://github.com/JerryLead/Misc/blob/master/SparkPRFigures/OOM/SPARK-22286-OOM.pdf for the high-definition version) The concrete phases with metrics are as follows. *[Shuffle read]* records = 7,540,235, bytes = 903 MB *[In-memory store]* As shown in the following log, about 5,243,424 of the 7,540,235 records are aggregated to 60records in AppendOnlyMap. Each record is about 60MB. There are only 60 distinct keys in the shuffled records. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/SpilledRecords.png|width=100%! *[Spill]* Since 3.6 GB has achieved the spill threshold, Spark spills the 60 records onto the disk. Since _60 < serializerBatchSize_ (default 10,000), all the 60 records are serialized into the SerializeBuffer and then written onto the disk as a file segment. The 60 serialized records are about 581 MB (this is an estimated size, while the real size maybe larger).
[jira] [Updated] (SPARK-22286) OutOfMemoryError caused by memory leak and large serializer batch size in ExternalAppendOnlyMap
[ https://issues.apache.org/jira/browse/SPARK-22286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lijie Xu updated SPARK-22286: - Description: *[Abstract]* I recently encountered an OOM error in a simple _groupByKey_ application. After profiling the application, I found the OOM error is related to the shuffle spill and records (de)serialization. After analyzing the OOM heap dump, I found the root causes are (1) memory leak in ExternalAppendOnlyMap, (2) large static serializer batch size (_spark.shuffle.spill.batchSize_ =1) defined in ExternalAppendOnlyMap, and (3) memory leak in the deserializer. Since almost all the Spark applications rely on ExternalAppendOnlyMap to perform shuffle and reduce, this is a critical bug/defect. In the following sections, I will detail the testing application, data, environment, failure symptoms, diagnosing procedure, identified root causes, and potential solutions. *[Application]* This is a simple GroupBy application as follows. {code} table.map(row => (row.sourceIP[1,7], row)).groupByKey().saveAsTextFile() {code} The _sourceIP_ (an IP address like 127.100.101.102) is a column of the _UserVisits_ table. This application has the same logic as the aggregation query in Berkeley SQL benchmark (https://amplab.cs.berkeley.edu/benchmark/) as follows. {code} SELECT * FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7); {code} The application code is available at \[1\]. *[Data]* The UserVisits table size is 16GB (9 columns, 132,000,000 rows) with uniform distribution. The HDFS block size is 128MB. The data generator is available at \[2\]. *[Environment]* Spark 2.1 (Spark 2.2 may also have this error), Oracle Java Hotspot 1.8.0, 1 master and 8 workers as follows. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Workers.png|width=100%! This application launched 32 executors. Each executor has 1 core and 7GB memory. The detailed application configuration is {code} total-executor-cores = 32 executor-cores = 1 executor-memory = 7G spark.default.parallelism=32 spark.serializer = JavaSerializer (KryoSerializer also has OOM error) {code} *[Failure symptoms]* This application has a map stage and a reduce stage. An OOM error occurs in a reduce task (Task-17) as follows. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Stage.png|width=100%! !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Tasks.png|width=100%! Task-17 generated an OOM error. It shuffled ~1GB data and spilled 3.6GB data onto the disk. Task-17 log below shows that this task is reading the next record by invoking _ExternalAppendOnlyMap.hasNext_(). From the OOM stack traces and the above shuffle metrics, we cannot identify the OOM root causes. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/OOMStackTrace.png|width=100%! A question is that why Task-17 still suffered OOM errors even after spilling large in-memory data onto the disk. *[Diagnosing procedure]* Since each executor has 1 core and 7GB, it runs only one task at a time and the task memory usage is going to exceed 7GB. *1: Identify the error phase* I added some debug logs in Spark, and found that the error phase is not the spill phase but the memory-disk-merge phase. The memory-disk-merge phase: Spark reads back the spilled records (as shown in ① Figure 1), merges the spilled records with the in-memory records (as shown in ②), generates new records, and output the new records onto HDFS (as shown in ③). *2. Dataflow and memory usage analysis* I added some profiling code and obtained dataflow and memory usage metrics as follows. Ki represents the _i_-th key, Ri represents the _i_-th row in the table. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/DataflowAndMemoryUsage.png|width=100%! Figure 1: Dataflow and Memory Usage Analysis (see https://github.com/JerryLead/Misc/blob/master/SparkPRFigures/OOM/SPARK-22286-OOM.pdf for the high-definition version) The concrete phases with metrics are as follows. *[Shuffle read]* records = 7,540,235, bytes = 903 MB *[In-memory store]* As shown in the following log, about 5,243,424 of the 7,540,235 records are aggregated to 60records in AppendOnlyMap. Each record is about 60MB. There are only 60 distinct keys in the shuffled records. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/SpilledRecords.png|width=100%! *[Spill]* Since 3.6 GB has achieved the spill threshold, Spark spills the 60 records onto the disk. Since _60 < serializerBatchSize_ (default 10,000), all the 60 records are serialized into the SerializeBuffer and then written onto the disk as a file segment. The 60 serialized records are about 581 MB (this is an estimated size, while the real size maybe larger).
[jira] [Updated] (SPARK-22675) Refactoring PropagateTypes in TypeCoercion
[ https://issues.apache.org/jira/browse/SPARK-22675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22675: Description: PropagateTypes are called at the beginning of TypeCocercion and then called at the end of TypeCocercion. Instead, we should call it in each rule that could change the data types for propagating the type changes above the parents. (was: PropagateTypes are called twice in TypeCoercion. We do not need to call it twice. Instead, we should call it after each change on the types. ) > Refactoring PropagateTypes in TypeCoercion > -- > > Key: SPARK-22675 > URL: https://issues.apache.org/jira/browse/SPARK-22675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > PropagateTypes are called at the beginning of TypeCocercion and then called > at the end of TypeCocercion. Instead, we should call it in each rule that > could change the data types for propagating the type changes above the > parents. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org