[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive
[ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1802: --- Description: I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. was: I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. Audit dependency graph when Spark is built with -Phive -- Key: SPARK-1802 URL: https://issues.apache.org/jira/browse/SPARK-1802 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar
[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive
[ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1802: --- Description: I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. was: I'd like to have binaries release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. Audit dependency graph when Spark is built with -Phive -- Key: SPARK-1802 URL: https://issues.apache.org/jira/browse/SPARK-1802 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar
[jira] [Created] (SPARK-1778) Add 'limit' transformation to SchemaRDD.
Takuya Ueshin created SPARK-1778: Summary: Add 'limit' transformation to SchemaRDD. Key: SPARK-1778 URL: https://issues.apache.org/jira/browse/SPARK-1778 Project: Spark Issue Type: Improvement Reporter: Takuya Ueshin Add {{limit}} transformation to {{SchemaRDD}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1787) Build failure on JDK8 :: SBT fails to load build configuration file
[ https://issues.apache.org/jira/browse/SPARK-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994244#comment-13994244 ] Richard Gomes commented on SPARK-1787: -- If I switch to JDK7, keeping everything else unchanged, SBT is able to load the build file. (j7s10)rgomes@terra:~/workspace/spark-0.9.1$ java -version java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) (j7s10)rgomes@terra:~/workspace/spark-0.9.1$ scala -version Scala code runner version 2.10.3 -- Copyright 2002-2013, LAMP/EPFL (j7s10)rgomes@terra:~/workspace/spark-0.9.1$ sbt/sbt clean Launching sbt from sbt/sbt-launch-0.12.4.jar [info] Loading project definition from /home/rgomes/workspace/spark-0.9.1/project/project [info] Compiling 1 Scala source to /home/rgomes/workspace/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes... [info] Loading project definition from /home/rgomes/workspace/spark-0.9.1/project [info] Set current project to root (in build file:/home/rgomes/workspace/spark-0.9.1/) [success] Total time: 0 s, completed 10-May-2014 15:40:26 Build failure on JDK8 :: SBT fails to load build configuration file --- Key: SPARK-1787 URL: https://issues.apache.org/jira/browse/SPARK-1787 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 0.9.0 Environment: JDK8 Scala 2.10.X SBT 0.12.X Reporter: Richard Gomes Priority: Minor SBT fails to build under JDK8. Please find steps to reproduce the error below: (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ uname -a Linux terra 3.13-1-amd64 #1 SMP Debian 3.13.10-1 (2014-04-15) x86_64 GNU/Linux (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ java -version java version 1.8.0_05 Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode) (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ scala -version Scala code runner version 2.10.3 -- Copyright 2002-2013, LAMP/EPFL (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ sbt/sbt clean Launching sbt from sbt/sbt-launch-0.12.4.jar Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=350m; support was removed in 8.0 [info] Loading project definition from /home/rgomes/workspace/spark-0.9.1/project/project [info] Compiling 1 Scala source to /home/rgomes/workspace/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes... [error] error while loading CharSequence, class file '/opt/developer/jdk1.8.0_05/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken [error] (bad constant pool tag 15 at byte 1501) [error] error while loading Comparator, class file '/opt/developer/jdk1.8.0_05/jre/lib/rt.jar(java/util/Comparator.class)' is broken [error] (bad constant pool tag 15 at byte 5003) [error] two errors found [error] (compile:compile) Compilation failed Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? q -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1760) mvn -Dsuites=* test throw an ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992873#comment-13992873 ] Guoqiang Li commented on SPARK-1760: Hi, [~srowen] Is there a perfect solution? The [ building-with-maven.md|https://github.com/apache/spark/blob/master/docs/building-with-maven.md] has been updated mvn -Dsuites=* test throw an ClassNotFoundException -- Key: SPARK-1760 URL: https://issues.apache.org/jira/browse/SPARK-1760 Project: Spark Issue Type: Bug Reporter: Guoqiang Li {{mvn -Dhadoop.version=0.23.9 -Phadoop-0.23 -Dsuites=org.apache.spark.repl.ReplSuite test}} = {code} *** RUN ABORTED *** java.lang.ClassNotFoundException: org.apache.spark.repl.ReplSuite at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1470) at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1469) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.immutable.List.foreach(List.scala:318) ... {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1781) Generalized validity checking for configuration parameters
[ https://issues.apache.org/jira/browse/SPARK-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993893#comment-13993893 ] William Benton commented on SPARK-1781: --- Could someone assign this issue to me? Generalized validity checking for configuration parameters -- Key: SPARK-1781 URL: https://issues.apache.org/jira/browse/SPARK-1781 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: William Benton Priority: Minor Issues like SPARK-1779 could be handled easily by a general mechanism for specifying whether or not a configuration parameter value is valid or not (and then excepting or warning and switching to a default value if it is not). I think it's possible to do this in a fairly lightweight fashion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1803) Rename test resources to be compatible with Windows FS
Stevo Slavic created SPARK-1803: --- Summary: Rename test resources to be compatible with Windows FS Key: SPARK-1803 URL: https://issues.apache.org/jira/browse/SPARK-1803 Project: Spark Issue Type: Task Components: Windows Affects Versions: 0.9.1 Reporter: Stevo Slavic Priority: Trivial {{git clone}} of master branch and then {{git status}} on Windows reports untracked files: {noformat} # Untracked files: # (use git add file... to include in what will be committed) # # sql/hive/src/test/resources/golden/Column pruning # sql/hive/src/test/resources/golden/Partition pruning # sql/hive/src/test/resources/golden/Partiton pruning {noformat} Actual issue is that several files under {{sql/hive/src/test/resources/golden}} directory have colon in name which is invalid character in file name on Windows. Please have these files renamed to a Windows compatible file name. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1802) Audit dependency graph when Spark is built with -Phive
[ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994990#comment-13994990 ] Sean Owen commented on SPARK-1802: -- [~pwendell] You can see my start on it here: https://github.com/srowen/spark/commits/SPARK-1802 https://github.com/srowen/spark/commit/a856604cfc67cb58146ada01fda6dbbb2515fa00 This resolves the new issues you note in your diff. Next issue is that hive-exec, quite awfully, includes a copy of all of its transitive dependencies in its artifact. See https://issues.apache.org/jira/browse/HIVE-5733 and note the warnings you'll get during assembly: {code} [WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping classes: [WARNING] - org.apache.thrift.transport.TSaslTransport$SaslResponse ... {code} hive-exec is in fact used in this module. Aside from actual surgery on the artifact with the shade plugin, you can't control the dependencies as a result. This may be simply the best that can be done right now. If it has worked, it has worked. Am I right that the datanucleus JARs *are* meant to be in the assembly, only for the Hive build? https://github.com/apache/spark/pull/688 https://github.com/apache/spark/pull/610 That's good if so since that's what your diff shows. Finally, while we're here, I note that there are still a few JAR conflicts that turn up when you build the assembly *without* Hive. (I'm going to ignore conflicts in examples; these can be cleaned up but aren't really a big deal given its nature.) We could touch those up too. This is in the normal build (and I know how to zap most of this problem): {code} [WARNING] commons-beanutils-core-1.8.0.jar, commons-beanutils-1.7.0.jar define 82 overlappping classes: {code} These turn up in the Hadoop 2.x + YARN build: {code} [WARNING] servlet-api-2.5.jar, javax.servlet-3.0.0.v201112011016.jar define 42 overlappping classes: ... [WARNING] jcl-over-slf4j-1.7.5.jar, commons-logging-1.1.3.jar define 6 overlappping classes: ... [WARNING] activation-1.1.jar, javax.activation-1.1.0.v201105071233.jar define 17 overlappping classes: ... [WARNING] servlet-api-2.5.jar, javax.servlet-3.0.0.v201112011016.jar define 42 overlappping classes: {code} These should be easy to track down. Shall I? Audit dependency graph when Spark is built with -Phive -- Key: SPARK-1802 URL: https://issues.apache.org/jira/browse/SPARK-1802 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1770) repartition and coalesce(shuffle=true) put objects with the same key in the same bucket
[ https://issues.apache.org/jira/browse/SPARK-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993494#comment-13993494 ] Sandeep Singh commented on SPARK-1770: -- I think this is fixed in PR https://github.com/apache/spark/pull/704 by [~pwendell] repartition and coalesce(shuffle=true) put objects with the same key in the same bucket --- Key: SPARK-1770 URL: https://issues.apache.org/jira/browse/SPARK-1770 Project: Spark Issue Type: Bug Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Matei Zaharia Priority: Blocker Labels: Starter Fix For: 1.0.0 This is bad when you have many identical objects. We should assign each one a random key. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1797) streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected
[ https://issues.apache.org/jira/browse/SPARK-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] QingFeng Zhang updated SPARK-1797: -- Description: when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200 files , but the sum of rdd.count() is less than 200, always between 130 and 170, I don't know why...Is this a Bug? PS: When I put 200 files in hdfs before streaming run , It get the correct count and right result. was: when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200 files , but the sum of rdd.count() is less than 200, always between 130 and 170, I don't know why...Is this a Bug? PS: When I put 200 files in hdfs before streaming run , It get the correct count and right result. def main(args: Array[String]) { val conf = new SparkConf().setMaster(SparkURL) .setAppName(QimageStreaming-broadcast) .setSparkHome(System.getenv(SPARK_HOME)) .setJars(SparkContext.jarOfClass(this.getClass())) conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) conf.set(spark.kryo.registrator, qing.hdu.Image.MyRegistrator) conf.set(spark.kryoserializer.buffer.mb, 10); val ssc = new StreamingContext(conf, Seconds(2)) val inputFormatClass = classOf[QimageInputFormat[Text, Qimage]] val outputFormatClass = classOf[QimageOutputFormat[Text, Qimage]] val input_path = HdfsURL + /Qimage/input val output_path = HdfsURL + /Qimage/output/ val bg_path = HdfsURL + /Qimage/bg/ val bg = ssc.sparkContext.newAPIHadoopFile[Text, Qimage, QimageInputFormat[Text, Qimage]](bg_path) val bbg = bg.map(data = (data._1.toString(), data._2)) val broadcastbg = ssc.sparkContext.broadcast(bbg) val file = ssc.fileStream[Text, Qimage, QimageInputFormat[Text, Qimage]](input_path) val qingbg = broadcastbg.value.collectAsMap val foreachFunc = (rdd: RDD[(Text, Qimage)], time: Time) = { val rddnum = rdd.count System.out.println(\n\n+ rddnum is + rddnum + \n\n) if (rddnum 0) { System.out.println(here is foreachFunc) val a = rdd.keys val b = a.first val cbg = qingbg.get(getbgID(b)).getOrElse(new Qimage) rdd.map(data = (data._1, (new QimageProc(data._1, data._2)).koutu(cbg))) .saveAsNewAPIHadoopFile(output_path, classOf[Text], classOf[Qimage], outputFormatClass) } } file.foreachRDD(foreachFunc) ssc.start() ssc.awaitTermination() } streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected - Key: SPARK-1797 URL: https://issues.apache.org/jira/browse/SPARK-1797 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 0.9.0 Environment: spark0.9.0,hadoop2.3.0,1 Master,5 Slaves. Reporter: QingFeng Zhang Attachments: 1.png when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200 files , but the sum of rdd.count() is less than 200, always between 130 and 170, I don't know why...Is this a Bug? PS: When I put 200 files in hdfs before streaming run , It get the correct count and right result. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-1797) streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected
[ https://issues.apache.org/jira/browse/SPARK-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] QingFeng Zhang reopened SPARK-1797: --- streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected - Key: SPARK-1797 URL: https://issues.apache.org/jira/browse/SPARK-1797 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 0.9.0 Environment: spark0.9.0,hadoop2.3.0,1 Master,5 Slaves. Reporter: QingFeng Zhang Attachments: 1.png when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200 files , but the sum of rdd.count() is less than 200, always between 130 and 170, I don't know why...Is this a Bug? PS: When I put 200 files in hdfs before streaming run , It get the correct count and right result. def main(args: Array[String]) { val conf = new SparkConf().setMaster(SparkURL) .setAppName(QimageStreaming-broadcast) .setSparkHome(System.getenv(SPARK_HOME)) .setJars(SparkContext.jarOfClass(this.getClass())) conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) conf.set(spark.kryo.registrator, qing.hdu.Image.MyRegistrator) conf.set(spark.kryoserializer.buffer.mb, 10); val ssc = new StreamingContext(conf, Seconds(2)) val inputFormatClass = classOf[QimageInputFormat[Text, Qimage]] val outputFormatClass = classOf[QimageOutputFormat[Text, Qimage]] val input_path = HdfsURL + /Qimage/input val output_path = HdfsURL + /Qimage/output/ val bg_path = HdfsURL + /Qimage/bg/ val bg = ssc.sparkContext.newAPIHadoopFile[Text, Qimage, QimageInputFormat[Text, Qimage]](bg_path) val bbg = bg.map(data = (data._1.toString(), data._2)) val broadcastbg = ssc.sparkContext.broadcast(bbg) val file = ssc.fileStream[Text, Qimage, QimageInputFormat[Text, Qimage]](input_path) val qingbg = broadcastbg.value.collectAsMap val foreachFunc = (rdd: RDD[(Text, Qimage)], time: Time) = { val rddnum = rdd.count System.out.println(\n\n+ rddnum is + rddnum + \n\n) if (rddnum 0) { System.out.println(here is foreachFunc) val a = rdd.keys val b = a.first val cbg = qingbg.get(getbgID(b)).getOrElse(new Qimage) rdd.map(data = (data._1, (new QimageProc(data._1, data._2)).koutu(cbg))) .saveAsNewAPIHadoopFile(output_path, classOf[Text], classOf[Qimage], outputFormatClass) } } file.foreachRDD(foreachFunc) ssc.start() ssc.awaitTermination() } -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError
[ https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992920#comment-13992920 ] sri commented on SPARK-1394: We also bumping into the same issue. My I know, how and where can we comment the signal binding in pyspark? calling system.platform on worker raises IOError Key: SPARK-1394 URL: https://issues.apache.org/jira/browse/SPARK-1394 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0 Environment: Tested on Ubuntu and Linux, local and remote master, python 2.7.* Reporter: Idan Zalzberg Labels: pyspark A simple program that calls system.platform() on the worker fails most of the time (it works some times but very rarely). This is critical since many libraries call that method (e.g. boto). Here is the trace of the attempt to call that method: $ /usr/local/spark/bin/pyspark Python 2.7.3 (default, Feb 27 2014, 20:00:17) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1) 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started 14/04/02 18:18:38 INFO Remoting: Starting remoting 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@10.33.102.46:36640] 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.33.102.46:36640] 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140402181839-919f 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 MB. 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id = ConnectionManagerId(10.33.102.46,43357) 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 10.33.102.46:43357 with 294.6 MB RAM 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at http://10.33.102.46:51803 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at http://10.33.102.46:4040 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 0.9.0 /_/ Using Python version 2.7.3 (default, Feb 27 2014 20:00:17) Spark context available as sc. import platform sc.parallelize([1]).map(lambda x : platform.system()).collect() 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at stdin:1 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at stdin:1) with 1 output partitions (allowLocal=false) 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at stdin:1) 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List() 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List() 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at collect at stdin:1), which has no missing parents 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[1] at collect at stdin:1) 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 12 ms 14/04/02 18:19:17 INFO Executor: Running task ID 0 PySpark worker failed with exception: Traceback (most recent call last): File /usr/local/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /usr/local/spark/python/pyspark/serializers.py, line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /usr/local/spark/python/pyspark/serializers.py, line 117, in dump_stream for obj in iterator: File
[jira] [Updated] (SPARK-1795) Add recursive directory file search to fileInputStream
[ https://issues.apache.org/jira/browse/SPARK-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick OToole updated SPARK-1795: --- Description: When writing logs, they are often partitioned into a hierarchical directory structure. This change will allow spark streaming to monitor all sub-directories of a parent directory to find new files as they are added. See https://github.com/apache/spark/pull/537 was:When writing logs, they are often partitioned into a hierarchical directory structure. This change will allow spark streaming to monitor all sub-directories of a parent directory to find new files as they are added. Priority: Major (was: Minor) Add recursive directory file search to fileInputStream -- Key: SPARK-1795 URL: https://issues.apache.org/jira/browse/SPARK-1795 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Rick OToole When writing logs, they are often partitioned into a hierarchical directory structure. This change will allow spark streaming to monitor all sub-directories of a parent directory to find new files as they are added. See https://github.com/apache/spark/pull/537 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1805) Error launching cluster when master and slaves machines are of different visualization types
Han JU created SPARK-1805: - Summary: Error launching cluster when master and slaves machines are of different visualization types Key: SPARK-1805 URL: https://issues.apache.org/jira/browse/SPARK-1805 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Han JU Priority: Minor In current EC2 script, the AMI image object is loaded only once. This is ok when master and slave machines are of the same visualization type (pvm or hvm). But this won't work if, say, master is pvm and slaves are hvm since the AMI is not compatible between these two kinds of visualization. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1803) Rename test resources to be compatible with Windows FS
[ https://issues.apache.org/jira/browse/SPARK-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995152#comment-13995152 ] Stevo Slavic commented on SPARK-1803: - Created pull request with fix for this issue (see [here|https://github.com/apache/spark/pull/739]). Rename test resources to be compatible with Windows FS -- Key: SPARK-1803 URL: https://issues.apache.org/jira/browse/SPARK-1803 Project: Spark Issue Type: Task Components: Windows Affects Versions: 0.9.1 Reporter: Stevo Slavic Priority: Trivial {{git clone}} of master branch and then {{git status}} on Windows reports untracked files: {noformat} # Untracked files: # (use git add file... to include in what will be committed) # # sql/hive/src/test/resources/golden/Column pruning # sql/hive/src/test/resources/golden/Partition pruning # sql/hive/src/test/resources/golden/Partiton pruning {noformat} Actual issue is that several files under {{sql/hive/src/test/resources/golden}} directory have colon in name which is invalid character in file name on Windows. Please have these files renamed to a Windows compatible file name. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1763) SparkSubmit arguments do not propagate to python files on YARN
Andrew Or created SPARK-1763: Summary: SparkSubmit arguments do not propagate to python files on YARN Key: SPARK-1763 URL: https://issues.apache.org/jira/browse/SPARK-1763 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core, YARN Affects Versions: 0.9.1 Reporter: Andrew Or Priority: Blocker Fix For: 1.0.0 The python SparkConf load defaults does not pick up system properties set by SparkSubmit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1755) Spark-submit --name does not resolve to application name on YARN
[ https://issues.apache.org/jira/browse/SPARK-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1755: - Fix Version/s: (was: 1.0.1) 1.0.0 Spark-submit --name does not resolve to application name on YARN Key: SPARK-1755 URL: https://issues.apache.org/jira/browse/SPARK-1755 Project: Spark Issue Type: Bug Affects Versions: 0.9.1 Reporter: Andrew Or Fix For: 1.0.0 In YARN client mode, --name is ignored because the deploy mode is client, and the name is for some reason a [cluster config|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L170)]. In YARN cluster mode, --name is passed to the org.apache.spark.deploy.yarn.Client as a command line argument. The Client class, however, uses this name only as the [app name for the RM|https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L80], but not for Spark. In other words, when SparkConf attempts to load default configs, application name is not set. In both cases, passing --name to SparkSubmit does not actually cause Spark to adopt it as its application name, despite what the usage promises. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1764) EOF reached before Python server acknowledged
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bouke van der Bijl updated SPARK-1764: -- Priority: Blocker (was: Critical) EOF reached before Python server acknowledged - Key: SPARK-1764 URL: https://issues.apache.org/jira/browse/SPARK-1764 Project: Spark Issue Type: Bug Components: Mesos, PySpark Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Priority: Blocker Labels: mesos, pyspark I'm getting EOF reached before Python server acknowledged while using PySpark on Mesos. The error manifests itself in multiple ways. One is: 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error EOF reached before Python server acknowledged; shutting down SparkContext And the other has a full stacktrace: 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server acknowledged org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:277) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This error causes the SparkContext to shutdown. I have not been able to reliably reproduce this bug, it seems to happen randomly, but if you run enough tasks on a SparkContext it'll hapen eventually -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1764) EOF reached before Python server acknowledged
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992994#comment-13992994 ] Bouke van der Bijl edited comment on SPARK-1764 at 5/8/14 6:31 PM: --- I can semi-reliably recreate this by just running this code: {{quote}} while True: sc.parallelize(range(100)).map(lambda n: n * 2).collect() {{quote}} Running this on Mesos will eventually crash with Py4JJavaError: An error occurred while calling o1142.collect. : org.apache.spark.SparkException: Job 101 cancelled as part of cancellation of all jobs at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:499) at org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1151) at org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1147) at akka.actor.SupervisorStrategy.handleFailure(FaultHandling.scala:295) at akka.actor.dungeon.FaultHandling$class.handleFailure(FaultHandling.scala:253) at akka.actor.ActorCell.handleFailure(ActorCell.scala:338) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:423) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262) at akka.dispatch.Mailbox.run(Mailbox.scala:218) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) I0508 18:29:03.623627 7868 sched.cpp:730] Stopping framework '20140508-173240-16842879-5050-24645-0032' 14/05/08 18:29:04 ERROR OneForOneStrategy: EOF reached before Python server acknowledged org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:277) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) was (Author: bouk): I can semi-reliably recreate this by just running
[jira] [Created] (SPARK-1806) Upgrade to Mesos 0.18.1 with Shaded Protobuf
Bernardo Gomez Palacio created SPARK-1806: - Summary: Upgrade to Mesos 0.18.1 with Shaded Protobuf Key: SPARK-1806 URL: https://issues.apache.org/jira/browse/SPARK-1806 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0, 1.0.1 Reporter: Bernardo Gomez Palacio Upgrade Spark to depend on Mesos 0.18.1 with shaded protobuf. This version of Mesos does not externalize its dependency on the protobuf version (now shaded through the namespace org.apache.mesos.protobuf) and therefore facilitates integration with systems that do depend on specific versions of protobufs such as Hadoop 1.0.x, 2.x, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1772) Spark executors do not successfully die on OOM
[ https://issues.apache.org/jira/browse/SPARK-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1772. Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 715 [https://github.com/apache/spark/pull/715] Spark executors do not successfully die on OOM -- Key: SPARK-1772 URL: https://issues.apache.org/jira/browse/SPARK-1772 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Aaron Davidson Fix For: 1.0.0 Executor catches Throwable, and does not always die when JVM fatal exceptions occur. This is a problem because any subsequent use of these Executors are very likely to fail. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1806) Upgrade to Mesos 0.18.1 with Shaded Protobuf
[ https://issues.apache.org/jira/browse/SPARK-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995265#comment-13995265 ] Bernardo Gomez Palacio commented on SPARK-1806: --- Should close SPARK-1433 Upgrade to Mesos 0.18.1 with Shaded Protobuf Key: SPARK-1806 URL: https://issues.apache.org/jira/browse/SPARK-1806 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0, 1.0.1 Reporter: Bernardo Gomez Palacio Labels: mesos Upgrade Spark to depend on Mesos 0.18.1 with shaded protobuf. This version of Mesos does not externalize its dependency on the protobuf version (now shaded through the namespace org.apache.mesos.protobuf) and therefore facilitates integration with systems that do depend on specific versions of protobufs such as Hadoop 1.0.x, 2.x, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1749) DAGScheduler supervisor strategy broken with Mesos
[ https://issues.apache.org/jira/browse/SPARK-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993602#comment-13993602 ] Bouke van der Bijl commented on SPARK-1749: --- This isn't really PySpark specific, this works fine on other backends which will mark the task as failed and just keep the SparkContext running. It shouldn't be shutting down the whole SparkContext just because a single job failed DAGScheduler supervisor strategy broken with Mesos -- Key: SPARK-1749 URL: https://issues.apache.org/jira/browse/SPARK-1749 Project: Spark Issue Type: Bug Components: Mesos, Spark Core Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Assignee: Mark Hamstra Priority: Blocker Labels: mesos, scheduler, scheduling Any bad Python code will trigger this bug, for example `sc.parallelize(range(100)).map(lambda n: undefined_variable * 2).collect()` will cause a `undefined_variable isn't defined`, which will cause spark to try to kill the task, resulting in the following stacktrace: java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:184) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:182) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:182) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:182) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:175) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:175) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1058) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045) at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:499) at org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1151) at org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1147) at akka.actor.SupervisorStrategy.handleFailure(FaultHandling.scala:295) at akka.actor.dungeon.FaultHandling$class.handleFailure(FaultHandling.scala:253) at akka.actor.ActorCell.handleFailure(ActorCell.scala:338) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:423) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262) at akka.dispatch.Mailbox.run(Mailbox.scala:218) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This is because killTask isn't implemented for the MesosSchedulerBackend. I assume this isn't pyspark-specific, as there will be other
[jira] [Commented] (SPARK-1806) Upgrade to Mesos 0.18.1 with Shaded Protobuf
[ https://issues.apache.org/jira/browse/SPARK-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995382#comment-13995382 ] Bernardo Gomez Palacio commented on SPARK-1806: --- Thanks [~pwendell] for addressing this so quickly! Upgrade to Mesos 0.18.1 with Shaded Protobuf Key: SPARK-1806 URL: https://issues.apache.org/jira/browse/SPARK-1806 Project: Spark Issue Type: Dependency upgrade Components: Spark Core Affects Versions: 1.0.0, 1.1.0, 1.0.1 Reporter: Bernardo Gomez Palacio Labels: mesos Fix For: 1.0.0 Upgrade Spark to depend on Mesos 0.18.1 with shaded protobuf. This version of Mesos does not externalize its dependency on the protobuf version (now shaded through the namespace org.apache.mesos.protobuf) and therefore facilitates integration with systems that do depend on specific versions of protobufs such as Hadoop 1.0.x, 2.x, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1806) Upgrade to Mesos 0.18.1 with Shaded Protobuf
[ https://issues.apache.org/jira/browse/SPARK-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1806. Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 741 [https://github.com/apache/spark/pull/741] Upgrade to Mesos 0.18.1 with Shaded Protobuf Key: SPARK-1806 URL: https://issues.apache.org/jira/browse/SPARK-1806 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0, 1.0.1 Reporter: Bernardo Gomez Palacio Labels: mesos Fix For: 1.0.0 Upgrade Spark to depend on Mesos 0.18.1 with shaded protobuf. This version of Mesos does not externalize its dependency on the protobuf version (now shaded through the namespace org.apache.mesos.protobuf) and therefore facilitates integration with systems that do depend on specific versions of protobufs such as Hadoop 1.0.x, 2.x, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995421#comment-13995421 ] Bouke van der Bijl commented on SPARK-1764: --- I did some more digging into this and I have no idea what's the exact issue. The write to the Python server succeeds (which I checked from the Python side) but the Scala side doesn't seem to be able to read the acknowledgement. I have also confirmed that it isn't an issue with the Python broadcast server dying, as commented out the exception makes it work fine (!) EOF reached before Python server acknowledged - Key: SPARK-1764 URL: https://issues.apache.org/jira/browse/SPARK-1764 Project: Spark Issue Type: Bug Components: Mesos, PySpark Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Priority: Blocker Labels: mesos, pyspark I'm getting EOF reached before Python server acknowledged while using PySpark on Mesos. The error manifests itself in multiple ways. One is: 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error EOF reached before Python server acknowledged; shutting down SparkContext And the other has a full stacktrace: 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server acknowledged org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:277) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This error causes the SparkContext to shutdown. I have not been able to reliably reproduce this bug, it seems to happen randomly, but if you run enough tasks on a SparkContext it'll hapen eventually -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive
[ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1802: - Attachment: hive-exec-jar-problems.txt Audit dependency graph when Spark is built with -Phive -- Key: SPARK-1802 URL: https://issues.apache.org/jira/browse/SPARK-1802 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Sean Owen Priority: Blocker Fix For: 1.0.0 Attachments: hive-exec-jar-problems.txt I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1801) Open up some private APIs related to creating new RDDs for developers
[ https://issues.apache.org/jira/browse/SPARK-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-1801: Summary: Open up some private APIs related to creating new RDDs for developers (was: Open up sime private APIs related to creating new RDDs for developers) Open up some private APIs related to creating new RDDs for developers - Key: SPARK-1801 URL: https://issues.apache.org/jira/browse/SPARK-1801 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: koert kuipers Priority: Minor in writing my own RDD i ran into a few issues with respect to stuff being private in spark. in compute i would like to return an iterator that respects task killing (as HadoopRDD does), but the mechanics for that are inside the private InterruptibleIterator. also the exception i am supposed to throw (TaskKilledException) is private to spark. See also: http://apache-spark-user-list.1001560.n3.nabble.com/Re-writing-my-own-RDD-td5558.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1736) spark-submit on Windows
[ https://issues.apache.org/jira/browse/SPARK-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1736. Resolution: Fixed spark-submit on Windows --- Key: SPARK-1736 URL: https://issues.apache.org/jira/browse/SPARK-1736 Project: Spark Issue Type: Improvement Components: Windows Reporter: Matei Zaharia Assignee: Andrew Or Priority: Blocker Fix For: 1.0.0 - spark-submit needs a Windows version (shouldn't be too hard, it's just launching a Java process) - spark-shell.cmd needs to run through spark-submit like it does on Unix -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1652) Fixes and improvements for spark-submit/configs
[ https://issues.apache.org/jira/browse/SPARK-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995892#comment-13995892 ] Patrick Wendell commented on SPARK-1652: The remaining issues here all have work-arounds in 1.0. So I'm bumping this to 1.1 Fixes and improvements for spark-submit/configs --- Key: SPARK-1652 URL: https://issues.apache.org/jira/browse/SPARK-1652 Project: Spark Issue Type: Bug Components: Spark Core, YARN Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.1.0 These are almost all a result of my config patch. Unfortunately the changes were difficult to unit-test and there several edge cases reported. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive
[ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1802: --- Assignee: Sean Owen Audit dependency graph when Spark is built with -Phive -- Key: SPARK-1802 URL: https://issues.apache.org/jira/browse/SPARK-1802 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Sean Owen Priority: Blocker Fix For: 1.0.0 I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1815) SparkContext's constructor that only takes a SparkConf shouldn't be a DeveloperApi
Sandy Ryza created SPARK-1815: - Summary: SparkContext's constructor that only takes a SparkConf shouldn't be a DeveloperApi Key: SPARK-1815 URL: https://issues.apache.org/jira/browse/SPARK-1815 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza Fix For: 1.0.0 It's the constructor used in the examples. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1811) Support resizable output buffer for kryo serializer
koert kuipers created SPARK-1811: Summary: Support resizable output buffer for kryo serializer Key: SPARK-1811 URL: https://issues.apache.org/jira/browse/SPARK-1811 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: koert kuipers Priority: Minor Currently the size of kryo serializer output buffer can be set with spark.kryoserializer.buffer.mb The issue with this setting is that it has to be one-size-fits-all, so it ends up being the maximum size needed, even if only a single task out of many needs it to be that big. A resizable buffer will allow most tasks to use a modest sized buffer while the incidental task that needs a really big buffer can get it at a cost (allocating a new buffer and copying the contents over repeatedly as the buffer grows... with each new allocation the size doubles). The class used for the buffer is kryo Output, which supports resizing if maxCapacity is set bigger than capacity. I suggest we provide a setting spark.kryoserializer.buffer.max.mb which defaults to spark.kryoserializer.buffer.mb, and which sets Output's maxCapacity. Pull request for this jira: https://github.com/apache/spark/pull/735 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1680) Clean up use of setExecutorEnvs in SparkConf
[ https://issues.apache.org/jira/browse/SPARK-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1680: --- Description: We should make this consistent between YARN and Standalone. Basically, YARN mode should just use the executorEnvs from the Spark conf and not need SPARK_YARN_USER_ENV. (was: We should make this consistent between YARN and SparkConf. Basically, YARN mode should just use the executorEnvs from the Spark conf and not need SPARK_YARN_USER_ENV.) Clean up use of setExecutorEnvs in SparkConf - Key: SPARK-1680 URL: https://issues.apache.org/jira/browse/SPARK-1680 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 We should make this consistent between YARN and Standalone. Basically, YARN mode should just use the executorEnvs from the Spark conf and not need SPARK_YARN_USER_ENV. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1798) Tests should clean up temp files
[ https://issues.apache.org/jira/browse/SPARK-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1798. Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 732 [https://github.com/apache/spark/pull/732] Tests should clean up temp files Key: SPARK-1798 URL: https://issues.apache.org/jira/browse/SPARK-1798 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Sean Owen Priority: Minor Fix For: 1.0.0 Three issues related to temp files that tests generate -- these should be touched up for hygiene but are not urgent. Modules have a log4j.properties which directs the unit-test.log output file to a directory like [module]/target/unit-test.log. But this ends up creating [module]/[module]/target/unit-test.log instead of former. The work/ directory is not deleted by mvn clean, in the parent and in modules. Neither is the checkpoint/ directory created under the various external modules. Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling deleteOnExit() at creation and trying to call Utils.deleteRecursively consistently to clean up, sometimes in an @After method. (If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of LocalSparkContext, which provides management of temp directories for subclasses to take advantage of.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1773) Standalone cluster docs should be updated to reflect Spark Submit
[ https://issues.apache.org/jira/browse/SPARK-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1773. Resolution: Fixed Standalone cluster docs should be updated to reflect Spark Submit - Key: SPARK-1773 URL: https://issues.apache.org/jira/browse/SPARK-1773 Project: Spark Issue Type: Bug Components: Documentation Reporter: Patrick Wendell Assignee: Andrew Or Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1771) CoarseGrainedSchedulerBackend is not resilient to Akka restarts
[ https://issues.apache.org/jira/browse/SPARK-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995962#comment-13995962 ] Nan Zhu commented on SPARK-1771: [#Aaron Davidson], I think there are basically two ways to fix this bug, which depends on whether we want to allow the restarting of the driver 1. assume we allow the restarting, we may need something similar to the persistentEngine in the deploy package 2. if not, we can introduce a supervisor actor to stop the DriverActor and kill the executorsjust similar with what we just did in the DAGScheduler CoarseGrainedSchedulerBackend is not resilient to Akka restarts --- Key: SPARK-1771 URL: https://issues.apache.org/jira/browse/SPARK-1771 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Aaron Davidson The exception reported in SPARK-1769 was propagated through the CoarseGrainedSchedulerBackend, and caused an Actor restart of the DriverActor. Unfortunately, this actor does not seem to have been written with Akka restartability in mind. For instance, the new DriverActor has lost all state about the prior Executors without cleanly disconnecting them. This means that the driver actually has executors attached to it, but doesn't think it does, which leads to mayhem of various sorts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1652) Fixes and improvements for spark-submit/configs
[ https://issues.apache.org/jira/browse/SPARK-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1652: --- Fix Version/s: (was: 1.0.0) 1.1.0 Fixes and improvements for spark-submit/configs --- Key: SPARK-1652 URL: https://issues.apache.org/jira/browse/SPARK-1652 Project: Spark Issue Type: Bug Components: Spark Core, YARN Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.1.0 These are almost all a result of my config patch. Unfortunately the changes were difficult to unit-test and there several edge cases reported. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1779) Warning when spark.storage.memoryFraction is not between 0 and 1
[ https://issues.apache.org/jira/browse/SPARK-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993615#comment-13993615 ] Guoqiang Li commented on SPARK-1779: I think here should throw an exception. Warning when spark.storage.memoryFraction is not between 0 and 1 Key: SPARK-1779 URL: https://issues.apache.org/jira/browse/SPARK-1779 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: wangfei Fix For: 1.1.0 There should be a warning when memoryFraction is lower than 0 or greater than 1 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-1802) Audit dependency graph when Spark is built with -Phive
[ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-1802: Let's keep this open given the ongoing discussion. Audit dependency graph when Spark is built with -Phive -- Key: SPARK-1802 URL: https://issues.apache.org/jira/browse/SPARK-1802 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Sean Owen Priority: Blocker Fix For: 1.0.0 Attachments: hive-exec-jar-problems.txt I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1757) Support saving null primitives with .saveAsParquetFile()
[ https://issues.apache.org/jira/browse/SPARK-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash resolved SPARK-1757. --- Resolution: Fixed Fix Version/s: 1.0.0 https://github.com/apache/spark/pull/690 Support saving null primitives with .saveAsParquetFile() Key: SPARK-1757 URL: https://issues.apache.org/jira/browse/SPARK-1757 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Andrew Ash Fix For: 1.0.0 See stack trace below: {noformat} 14/05/07 21:45:51 INFO analysis.Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/05/07 21:45:51 INFO analysis.Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences 14/05/07 21:45:51 INFO optimizer.Optimizer$: Max iterations (2) reached for batch ConstantFolding 14/05/07 21:45:51 INFO optimizer.Optimizer$: Max iterations (2) reached for batch Filter Pushdown java.lang.RuntimeException: Unsupported datatype StructType(List()) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetRelation.scala:201) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetRelation.scala:234) at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetRelation.scala:267) at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:143) at org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:122) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:139) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:264) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:264) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:265) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:265) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:268) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:268) at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:66) at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:96) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1814) Splash page should include correct syntax for launching examples
Patrick Wendell created SPARK-1814: -- Summary: Splash page should include correct syntax for launching examples Key: SPARK-1814 URL: https://issues.apache.org/jira/browse/SPARK-1814 Project: Spark Issue Type: Sub-task Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995632#comment-13995632 ] Bouke van der Bijl commented on SPARK-1764: --- Interesting, including SPARK-1806 in our build made it stop failing... I guess this can be considered fixed then EOF reached before Python server acknowledged - Key: SPARK-1764 URL: https://issues.apache.org/jira/browse/SPARK-1764 Project: Spark Issue Type: Bug Components: Mesos, PySpark Affects Versions: 1.0.0 Reporter: Bouke van der Bijl Priority: Blocker Labels: mesos, pyspark I'm getting EOF reached before Python server acknowledged while using PySpark on Mesos. The error manifests itself in multiple ways. One is: 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error EOF reached before Python server acknowledged; shutting down SparkContext And the other has a full stacktrace: 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server acknowledged org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:277) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) This error causes the SparkContext to shutdown. I have not been able to reliably reproduce this bug, it seems to happen randomly, but if you run enough tasks on a SparkContext it'll hapen eventually -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1765) Modify a typo in monitoring.md
[ https://issues.apache.org/jira/browse/SPARK-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996012#comment-13996012 ] Andrew Ash commented on SPARK-1765: --- https://github.com/apache/spark/pull/698 This can now be closed Modify a typo in monitoring.md -- Key: SPARK-1765 URL: https://issues.apache.org/jira/browse/SPARK-1765 Project: Spark Issue Type: Bug Reporter: Kousuke Saruta Priority: Minor There is a word 'JXM' In monitoring.md. I guess, it's a typo for 'JMX'. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1816) LiveListenerBus dies if a listener throws an exception
Aaron Davidson created SPARK-1816: - Summary: LiveListenerBus dies if a listener throws an exception Key: SPARK-1816 URL: https://issues.apache.org/jira/browse/SPARK-1816 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Andrew Or Priority: Critical The exception isn't even printed. -- This message was sent by Atlassian JIRA (v6.2#6252)