[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1802:
---

Description: 
I'd like to have binary release for 1.0 include Hive support. Since this isn't 
enabled by default in the build I don't think it's as well tested, so we should 
dig around a bit and decide if we need to e.g. add any excludes.

{code}
$ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl assembly 
| grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' | sort  
without_hive.txt

$ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' | 
sort  with_hive.txt

$ diff without_hive.txt with_hive.txt
 antlr-2.7.7.jar
 antlr-3.4.jar
 antlr-runtime-3.4.jar
10,14d6
 avro-1.7.4.jar
 avro-ipc-1.7.4.jar
 avro-ipc-1.7.4-tests.jar
 avro-mapred-1.7.4.jar
 bonecp-0.7.1.RELEASE.jar
22d13
 commons-cli-1.2.jar
25d15
 commons-compress-1.4.1.jar
33,34d22
 commons-logging-1.1.1.jar
 commons-logging-api-1.0.4.jar
38d25
 commons-pool-1.5.4.jar
46,49d32
 datanucleus-api-jdo-3.2.1.jar
 datanucleus-core-3.2.2.jar
 datanucleus-rdbms-3.2.1.jar
 derby-10.4.2.0.jar
53,57d35
 hive-common-0.12.0.jar
 hive-exec-0.12.0.jar
 hive-metastore-0.12.0.jar
 hive-serde-0.12.0.jar
 hive-shims-0.12.0.jar
60,61d37
 httpclient-4.1.3.jar
 httpcore-4.1.3.jar
68d43
 JavaEWAH-0.3.2.jar
73d47
 javolution-5.5.1.jar
76d49
 jdo-api-3.0.1.jar
78d50
 jetty-6.1.26.jar
87d58
 jetty-util-6.1.26.jar
93d63
 json-20090211.jar
98d67
 jta-1.1.jar
103,104d71
 libfb303-0.9.0.jar
 libthrift-0.9.0.jar
112d78
 mockito-all-1.8.5.jar
136d101
 servlet-api-2.5-20081211.jar
139d103
 snappy-0.2.jar
144d107
 spark-hive_2.10-1.0.0.jar
151d113
 ST4-4.0.4.jar
153d114
 stringtemplate-3.2.1.jar
156d116
 velocity-1.7.jar
158d117
 xz-1.0.jar
{code}

Some initial investigation suggests we may need to take some precaution 
surrounding (a) jetty and (b) servlet-api.

  was:
I'd like to have binary release for 1.0 include Hive support. Since this isn't 
enabled by default in the build I don't think it's as well tested, so we should 
dig around a bit and decide if we need to e.g. add any excludes.

{code}
$ mvn install -Phive -DskipTests  mvn dependency:build-classpath assembly | 
grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' | sort  
without_hive.txt

$ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' | 
sort  with_hive.txt

$ diff without_hive.txt with_hive.txt
 antlr-2.7.7.jar
 antlr-3.4.jar
 antlr-runtime-3.4.jar
10,14d6
 avro-1.7.4.jar
 avro-ipc-1.7.4.jar
 avro-ipc-1.7.4-tests.jar
 avro-mapred-1.7.4.jar
 bonecp-0.7.1.RELEASE.jar
22d13
 commons-cli-1.2.jar
25d15
 commons-compress-1.4.1.jar
33,34d22
 commons-logging-1.1.1.jar
 commons-logging-api-1.0.4.jar
38d25
 commons-pool-1.5.4.jar
46,49d32
 datanucleus-api-jdo-3.2.1.jar
 datanucleus-core-3.2.2.jar
 datanucleus-rdbms-3.2.1.jar
 derby-10.4.2.0.jar
53,57d35
 hive-common-0.12.0.jar
 hive-exec-0.12.0.jar
 hive-metastore-0.12.0.jar
 hive-serde-0.12.0.jar
 hive-shims-0.12.0.jar
60,61d37
 httpclient-4.1.3.jar
 httpcore-4.1.3.jar
68d43
 JavaEWAH-0.3.2.jar
73d47
 javolution-5.5.1.jar
76d49
 jdo-api-3.0.1.jar
78d50
 jetty-6.1.26.jar
87d58
 jetty-util-6.1.26.jar
93d63
 json-20090211.jar
98d67
 jta-1.1.jar
103,104d71
 libfb303-0.9.0.jar
 libthrift-0.9.0.jar
112d78
 mockito-all-1.8.5.jar
136d101
 servlet-api-2.5-20081211.jar
139d103
 snappy-0.2.jar
144d107
 spark-hive_2.10-1.0.0.jar
151d113
 ST4-4.0.4.jar
153d114
 stringtemplate-3.2.1.jar
156d116
 velocity-1.7.jar
158d117
 xz-1.0.jar
{code}

Some initial investigation suggests we may need to take some precaution 
surrounding (a) jetty and (b) servlet-api.


 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  

[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1802:
---

Description: 
I'd like to have binary release for 1.0 include Hive support. Since this isn't 
enabled by default in the build I don't think it's as well tested, so we should 
dig around a bit and decide if we need to e.g. add any excludes.

{code}
$ mvn install -Phive -DskipTests  mvn dependency:build-classpath assembly | 
grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' | sort  
without_hive.txt

$ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' | 
sort  with_hive.txt

$ diff without_hive.txt with_hive.txt
 antlr-2.7.7.jar
 antlr-3.4.jar
 antlr-runtime-3.4.jar
10,14d6
 avro-1.7.4.jar
 avro-ipc-1.7.4.jar
 avro-ipc-1.7.4-tests.jar
 avro-mapred-1.7.4.jar
 bonecp-0.7.1.RELEASE.jar
22d13
 commons-cli-1.2.jar
25d15
 commons-compress-1.4.1.jar
33,34d22
 commons-logging-1.1.1.jar
 commons-logging-api-1.0.4.jar
38d25
 commons-pool-1.5.4.jar
46,49d32
 datanucleus-api-jdo-3.2.1.jar
 datanucleus-core-3.2.2.jar
 datanucleus-rdbms-3.2.1.jar
 derby-10.4.2.0.jar
53,57d35
 hive-common-0.12.0.jar
 hive-exec-0.12.0.jar
 hive-metastore-0.12.0.jar
 hive-serde-0.12.0.jar
 hive-shims-0.12.0.jar
60,61d37
 httpclient-4.1.3.jar
 httpcore-4.1.3.jar
68d43
 JavaEWAH-0.3.2.jar
73d47
 javolution-5.5.1.jar
76d49
 jdo-api-3.0.1.jar
78d50
 jetty-6.1.26.jar
87d58
 jetty-util-6.1.26.jar
93d63
 json-20090211.jar
98d67
 jta-1.1.jar
103,104d71
 libfb303-0.9.0.jar
 libthrift-0.9.0.jar
112d78
 mockito-all-1.8.5.jar
136d101
 servlet-api-2.5-20081211.jar
139d103
 snappy-0.2.jar
144d107
 spark-hive_2.10-1.0.0.jar
151d113
 ST4-4.0.4.jar
153d114
 stringtemplate-3.2.1.jar
156d116
 velocity-1.7.jar
158d117
 xz-1.0.jar
{code}

Some initial investigation suggests we may need to take some precaution 
surrounding (a) jetty and (b) servlet-api.

  was:
I'd like to have binaries release for 1.0 include Hive support. Since this 
isn't enabled by default in the build I don't think it's as well tested, so we 
should dig around a bit and decide if we need to e.g. add any excludes.

{code}
$ mvn install -Phive -DskipTests  mvn dependency:build-classpath assembly | 
grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' | sort  
without_hive.txt

$ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' | 
sort  with_hive.txt

$ diff without_hive.txt with_hive.txt
 antlr-2.7.7.jar
 antlr-3.4.jar
 antlr-runtime-3.4.jar
10,14d6
 avro-1.7.4.jar
 avro-ipc-1.7.4.jar
 avro-ipc-1.7.4-tests.jar
 avro-mapred-1.7.4.jar
 bonecp-0.7.1.RELEASE.jar
22d13
 commons-cli-1.2.jar
25d15
 commons-compress-1.4.1.jar
33,34d22
 commons-logging-1.1.1.jar
 commons-logging-api-1.0.4.jar
38d25
 commons-pool-1.5.4.jar
46,49d32
 datanucleus-api-jdo-3.2.1.jar
 datanucleus-core-3.2.2.jar
 datanucleus-rdbms-3.2.1.jar
 derby-10.4.2.0.jar
53,57d35
 hive-common-0.12.0.jar
 hive-exec-0.12.0.jar
 hive-metastore-0.12.0.jar
 hive-serde-0.12.0.jar
 hive-shims-0.12.0.jar
60,61d37
 httpclient-4.1.3.jar
 httpcore-4.1.3.jar
68d43
 JavaEWAH-0.3.2.jar
73d47
 javolution-5.5.1.jar
76d49
 jdo-api-3.0.1.jar
78d50
 jetty-6.1.26.jar
87d58
 jetty-util-6.1.26.jar
93d63
 json-20090211.jar
98d67
 jta-1.1.jar
103,104d71
 libfb303-0.9.0.jar
 libthrift-0.9.0.jar
112d78
 mockito-all-1.8.5.jar
136d101
 servlet-api-2.5-20081211.jar
139d103
 snappy-0.2.jar
144d107
 spark-hive_2.10-1.0.0.jar
151d113
 ST4-4.0.4.jar
153d114
 stringtemplate-3.2.1.jar
156d116
 velocity-1.7.jar
158d117
 xz-1.0.jar
{code}

Some initial investigation suggests we may need to take some precaution 
surrounding (a) jetty and (b) servlet-api.


 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath assembly | 
 grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' | sort  
 without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  

[jira] [Created] (SPARK-1778) Add 'limit' transformation to SchemaRDD.

2014-05-12 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-1778:


 Summary: Add 'limit' transformation to SchemaRDD.
 Key: SPARK-1778
 URL: https://issues.apache.org/jira/browse/SPARK-1778
 Project: Spark
  Issue Type: Improvement
Reporter: Takuya Ueshin


Add {{limit}} transformation to {{SchemaRDD}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1787) Build failure on JDK8 :: SBT fails to load build configuration file

2014-05-12 Thread Richard Gomes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994244#comment-13994244
 ] 

Richard Gomes commented on SPARK-1787:
--

If I switch to JDK7, keeping everything else unchanged, SBT is able to load the 
build file.


(j7s10)rgomes@terra:~/workspace/spark-0.9.1$ java -version
java version 1.7.0_51
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


(j7s10)rgomes@terra:~/workspace/spark-0.9.1$ scala -version
Scala code runner version 2.10.3 -- Copyright 2002-2013, LAMP/EPFL


(j7s10)rgomes@terra:~/workspace/spark-0.9.1$ sbt/sbt clean
Launching sbt from sbt/sbt-launch-0.12.4.jar
[info] Loading project definition from 
/home/rgomes/workspace/spark-0.9.1/project/project
[info] Compiling 1 Scala source to 
/home/rgomes/workspace/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes...
[info] Loading project definition from 
/home/rgomes/workspace/spark-0.9.1/project
[info] Set current project to root (in build 
file:/home/rgomes/workspace/spark-0.9.1/)
[success] Total time: 0 s, completed 10-May-2014 15:40:26


 Build failure on JDK8 :: SBT fails to load build configuration file
 ---

 Key: SPARK-1787
 URL: https://issues.apache.org/jira/browse/SPARK-1787
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 0.9.0
 Environment: JDK8
 Scala 2.10.X
 SBT 0.12.X
Reporter: Richard Gomes
Priority: Minor

 SBT fails to build under JDK8.
 Please find steps to reproduce the error below:
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ uname -a
 Linux terra 3.13-1-amd64 #1 SMP Debian 3.13.10-1 (2014-04-15) x86_64 GNU/Linux
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ java -version
 java version 1.8.0_05
 Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ scala -version
 Scala code runner version 2.10.3 -- Copyright 2002-2013, LAMP/EPFL
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ sbt/sbt clean
 Launching sbt from sbt/sbt-launch-0.12.4.jar
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=350m; 
 support was removed in 8.0
 [info] Loading project definition from 
 /home/rgomes/workspace/spark-0.9.1/project/project
 [info] Compiling 1 Scala source to 
 /home/rgomes/workspace/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes...
 [error] error while loading CharSequence, class file 
 '/opt/developer/jdk1.8.0_05/jre/lib/rt.jar(java/lang/CharSequence.class)' is 
 broken
 [error] (bad constant pool tag 15 at byte 1501)
 [error] error while loading Comparator, class file 
 '/opt/developer/jdk1.8.0_05/jre/lib/rt.jar(java/util/Comparator.class)' is 
 broken
 [error] (bad constant pool tag 15 at byte 5003)
 [error] two errors found
 [error] (compile:compile) Compilation failed
 Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? q



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1760) mvn -Dsuites=* test throw an ClassNotFoundException

2014-05-12 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992873#comment-13992873
 ] 

Guoqiang Li commented on SPARK-1760:


Hi, [~srowen]
Is there a perfect solution?
The [ 
building-with-maven.md|https://github.com/apache/spark/blob/master/docs/building-with-maven.md]
 has been updated

  mvn  -Dsuites=*  test throw an ClassNotFoundException
 --

 Key: SPARK-1760
 URL: https://issues.apache.org/jira/browse/SPARK-1760
 Project: Spark
  Issue Type: Bug
Reporter: Guoqiang Li

 {{mvn -Dhadoop.version=0.23.9 -Phadoop-0.23 
 -Dsuites=org.apache.spark.repl.ReplSuite test}} = 
 {code}
 *** RUN ABORTED ***
   java.lang.ClassNotFoundException: org.apache.spark.repl.ReplSuite
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1470)
   at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1469)
   at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
   at scala.collection.immutable.List.foreach(List.scala:318)
   ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1781) Generalized validity checking for configuration parameters

2014-05-12 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993893#comment-13993893
 ] 

William Benton commented on SPARK-1781:
---

Could someone assign this issue to me?

 Generalized validity checking for configuration parameters
 --

 Key: SPARK-1781
 URL: https://issues.apache.org/jira/browse/SPARK-1781
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: William Benton
Priority: Minor

 Issues like SPARK-1779 could be handled easily by a general mechanism for 
 specifying whether or not a configuration parameter value is valid or not 
 (and then excepting or warning and switching to a default value if it is 
 not).  I think it's possible to do this in a fairly lightweight fashion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1803) Rename test resources to be compatible with Windows FS

2014-05-12 Thread Stevo Slavic (JIRA)
Stevo Slavic created SPARK-1803:
---

 Summary: Rename test resources to be compatible with Windows FS
 Key: SPARK-1803
 URL: https://issues.apache.org/jira/browse/SPARK-1803
 Project: Spark
  Issue Type: Task
  Components: Windows
Affects Versions: 0.9.1
Reporter: Stevo Slavic
Priority: Trivial


{{git clone}} of master branch and then {{git status}} on Windows reports 
untracked files:

{noformat}
# Untracked files:
#   (use git add file... to include in what will be committed)
#
#   sql/hive/src/test/resources/golden/Column pruning
#   sql/hive/src/test/resources/golden/Partition pruning
#   sql/hive/src/test/resources/golden/Partiton pruning
{noformat}

Actual issue is that several files under {{sql/hive/src/test/resources/golden}} 
directory have colon in name which is invalid character in file name on Windows.

Please have these files renamed to a Windows compatible file name.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994990#comment-13994990
 ] 

Sean Owen commented on SPARK-1802:
--

[~pwendell] You can see my start on it here:

https://github.com/srowen/spark/commits/SPARK-1802
https://github.com/srowen/spark/commit/a856604cfc67cb58146ada01fda6dbbb2515fa00

This resolves the new issues you note in your diff.


Next issue is that hive-exec, quite awfully, includes a copy of all of its 
transitive dependencies in its artifact. See 
https://issues.apache.org/jira/browse/HIVE-5733 and note the warnings you'll 
get during assembly:

{code}
[WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping 
classes: 
[WARNING]   - org.apache.thrift.transport.TSaslTransport$SaslResponse
...
{code}

hive-exec is in fact used in this module. Aside from actual surgery on the 
artifact with the shade plugin, you can't control the dependencies as a result. 
This may be simply the best that can be done right now. If it has worked, it 
has worked.


Am I right that the datanucleus JARs *are* meant to be in the assembly, only 
for the Hive build?
https://github.com/apache/spark/pull/688
https://github.com/apache/spark/pull/610

That's good if so since that's what your diff shows.


Finally, while we're here, I note that there are still a few JAR conflicts that 
turn up when you build the assembly *without* Hive. (I'm going to ignore 
conflicts in examples; these can be cleaned up but aren't really a big deal 
given its nature.)  We could touch those up too.

This is in the normal build (and I know how to zap most of this problem):
{code}
[WARNING] commons-beanutils-core-1.8.0.jar, commons-beanutils-1.7.0.jar define 
82 overlappping classes: 
{code}

These turn up in the Hadoop 2.x + YARN build:
{code}
[WARNING] servlet-api-2.5.jar, javax.servlet-3.0.0.v201112011016.jar define 42 
overlappping classes: 
...
[WARNING] jcl-over-slf4j-1.7.5.jar, commons-logging-1.1.3.jar define 6 
overlappping classes: 
...
[WARNING] activation-1.1.jar, javax.activation-1.1.0.v201105071233.jar define 
17 overlappping classes: 
...
[WARNING] servlet-api-2.5.jar, javax.servlet-3.0.0.v201112011016.jar define 42 
overlappping classes: 
{code}

These should be easy to track down. Shall I?

 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation suggests we may need to take some precaution 
 surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1770) repartition and coalesce(shuffle=true) put objects with the same key in the same bucket

2014-05-12 Thread Sandeep Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993494#comment-13993494
 ] 

Sandeep Singh commented on SPARK-1770:
--

I think this is fixed in PR https://github.com/apache/spark/pull/704 by 
[~pwendell]

 repartition and coalesce(shuffle=true) put objects with the same key in the 
 same bucket
 ---

 Key: SPARK-1770
 URL: https://issues.apache.org/jira/browse/SPARK-1770
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Matei Zaharia
Priority: Blocker
  Labels: Starter
 Fix For: 1.0.0


 This is bad when you have many identical objects. We should assign each one a 
 random key.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1797) streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-12 Thread QingFeng Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

QingFeng Zhang updated SPARK-1797:
--

Description: 
when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200 
files , but the sum of rdd.count() is less than 200, always  between 130 and 
170, I don't know why...Is this a Bug?
PS: When I put 200 files in hdfs before streaming run , It get the correct 
count and right result.

  

  was:
when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200 
files , but the sum of rdd.count() is less than 200, always  between 130 and 
170, I don't know why...Is this a Bug?
PS: When I put 200 files in hdfs before streaming run , It get the correct 
count and right result.

  def main(args: Array[String]) {

val conf = new SparkConf().setMaster(SparkURL)
  .setAppName(QimageStreaming-broadcast)
  .setSparkHome(System.getenv(SPARK_HOME))
  .setJars(SparkContext.jarOfClass(this.getClass()))

conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrator, qing.hdu.Image.MyRegistrator)
conf.set(spark.kryoserializer.buffer.mb, 10); 

val ssc = new StreamingContext(conf, Seconds(2))

val inputFormatClass = classOf[QimageInputFormat[Text, Qimage]]
val outputFormatClass = classOf[QimageOutputFormat[Text, Qimage]]

val input_path = HdfsURL + /Qimage/input
val output_path = HdfsURL + /Qimage/output/
val bg_path = HdfsURL + /Qimage/bg/

val bg = ssc.sparkContext.newAPIHadoopFile[Text, Qimage, 
QimageInputFormat[Text, Qimage]](bg_path)
val bbg = bg.map(data = (data._1.toString(), data._2))
val broadcastbg = ssc.sparkContext.broadcast(bbg)
val file = ssc.fileStream[Text, Qimage, QimageInputFormat[Text, 
Qimage]](input_path)

val qingbg = broadcastbg.value.collectAsMap
val foreachFunc = (rdd: RDD[(Text, Qimage)], time: Time) = {
 val rddnum = rdd.count
  System.out.println(\n\n+ rddnum is  + rddnum + \n\n)
  if (rddnum  0) {  
System.out.println(here is foreachFunc)

   val a = rdd.keys

val b = a.first

val cbg = qingbg.get(getbgID(b)).getOrElse(new Qimage)

rdd.map(data = (data._1, (new QimageProc(data._1, 
data._2)).koutu(cbg)))
  .saveAsNewAPIHadoopFile(output_path, classOf[Text], classOf[Qimage], 
outputFormatClass)
  }

}

file.foreachRDD(foreachFunc)
ssc.start()
ssc.awaitTermination()
  }


 streaming on hdfs can detected all new file, but the sum of all the 
 rdd.count() not equals which had detected
 -

 Key: SPARK-1797
 URL: https://issues.apache.org/jira/browse/SPARK-1797
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 0.9.0
 Environment: spark0.9.0,hadoop2.3.0,1 Master,5 Slaves.
Reporter: QingFeng Zhang
 Attachments: 1.png


 when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200 
 files , but the sum of rdd.count() is less than 200, always  between 130 and 
 170, I don't know why...Is this a Bug?
 PS: When I put 200 files in hdfs before streaming run , It get the correct 
 count and right result.
   



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (SPARK-1797) streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-12 Thread QingFeng Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

QingFeng Zhang reopened SPARK-1797:
---


 streaming on hdfs can detected all new file, but the sum of all the 
 rdd.count() not equals which had detected
 -

 Key: SPARK-1797
 URL: https://issues.apache.org/jira/browse/SPARK-1797
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 0.9.0
 Environment: spark0.9.0,hadoop2.3.0,1 Master,5 Slaves.
Reporter: QingFeng Zhang
 Attachments: 1.png


 when I put 200 png files to Hdfs , I found sparkStreaming counld detect 200 
 files , but the sum of rdd.count() is less than 200, always  between 130 and 
 170, I don't know why...Is this a Bug?
 PS: When I put 200 files in hdfs before streaming run , It get the correct 
 count and right result.
   def main(args: Array[String]) {
 val conf = new SparkConf().setMaster(SparkURL)
   .setAppName(QimageStreaming-broadcast)
   .setSparkHome(System.getenv(SPARK_HOME))
   .setJars(SparkContext.jarOfClass(this.getClass()))
 conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
 conf.set(spark.kryo.registrator, qing.hdu.Image.MyRegistrator)
 conf.set(spark.kryoserializer.buffer.mb, 10); 
 val ssc = new StreamingContext(conf, Seconds(2))
 val inputFormatClass = classOf[QimageInputFormat[Text, Qimage]]
 val outputFormatClass = classOf[QimageOutputFormat[Text, Qimage]]
 val input_path = HdfsURL + /Qimage/input
 val output_path = HdfsURL + /Qimage/output/
 val bg_path = HdfsURL + /Qimage/bg/
 val bg = ssc.sparkContext.newAPIHadoopFile[Text, Qimage, 
 QimageInputFormat[Text, Qimage]](bg_path)
 val bbg = bg.map(data = (data._1.toString(), data._2))
 val broadcastbg = ssc.sparkContext.broadcast(bbg)
 val file = ssc.fileStream[Text, Qimage, QimageInputFormat[Text, 
 Qimage]](input_path)
 val qingbg = broadcastbg.value.collectAsMap
 val foreachFunc = (rdd: RDD[(Text, Qimage)], time: Time) = {
  val rddnum = rdd.count
   System.out.println(\n\n+ rddnum is  + rddnum + \n\n)
   if (rddnum  0) {  
 System.out.println(here is foreachFunc)
val a = rdd.keys
 val b = a.first
 val cbg = qingbg.get(getbgID(b)).getOrElse(new Qimage)
 rdd.map(data = (data._1, (new QimageProc(data._1, 
 data._2)).koutu(cbg)))
   .saveAsNewAPIHadoopFile(output_path, classOf[Text], 
 classOf[Qimage], outputFormatClass)
   }
 }
 file.foreachRDD(foreachFunc)
 ssc.start()
 ssc.awaitTermination()
   }



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError

2014-05-12 Thread sri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992920#comment-13992920
 ] 

sri commented on SPARK-1394:


We also bumping into the same issue. My I know, how and where can we comment 
the signal binding in pyspark?

 calling system.platform on worker raises IOError
 

 Key: SPARK-1394
 URL: https://issues.apache.org/jira/browse/SPARK-1394
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0
 Environment: Tested on Ubuntu and Linux, local and remote master, 
 python 2.7.*
Reporter: Idan Zalzberg
  Labels: pyspark

 A simple program that calls system.platform() on the worker fails most of the 
 time (it works some times but very rarely).
 This is critical since many libraries call that method (e.g. boto).
 Here is the trace of the attempt to call that method:
 $ /usr/local/spark/bin/pyspark
 Python 2.7.3 (default, Feb 27 2014, 20:00:17)
 [GCC 4.6.3] on linux2
 Type help, copyright, credits or license for more information.
 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback 
 address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1)
 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started
 14/04/02 18:18:38 INFO Remoting: Starting remoting
 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@10.33.102.46:36640]
 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@10.33.102.46:36640]
 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster
 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140402181839-919f
 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 
 MB.
 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id 
 = ConnectionManagerId(10.33.102.46,43357)
 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager
 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
 block manager 10.33.102.46:43357 with 294.6 MB RAM
 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager
 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at 
 http://10.33.102.46:51803
 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker
 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b
 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at 
 http://10.33.102.46:4040
 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 0.9.0
   /_/
 Using Python version 2.7.3 (default, Feb 27 2014 20:00:17)
 Spark context available as sc.
  import platform
  sc.parallelize([1]).map(lambda x : platform.system()).collect()
 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at stdin:1
 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at stdin:1) with 1 
 output partitions (allowLocal=false)
 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at 
 stdin:1)
 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List()
 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List()
 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at 
 collect at stdin:1), which has no missing parents
 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (PythonRDD[1] at collect at stdin:1)
 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 
 12 ms
 14/04/02 18:19:17 INFO Executor: Running task ID 0
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /usr/local/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /usr/local/spark/python/pyspark/serializers.py, line 182, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /usr/local/spark/python/pyspark/serializers.py, line 117, in 
 dump_stream
 for obj in iterator:
   File 

[jira] [Updated] (SPARK-1795) Add recursive directory file search to fileInputStream

2014-05-12 Thread Rick OToole (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick OToole updated SPARK-1795:
---

Description: 
When writing logs, they are often partitioned into a hierarchical directory 
structure. This change will allow spark streaming to monitor all 
sub-directories of a parent directory to find new files as they are added. 

See https://github.com/apache/spark/pull/537

  was:When writing logs, they are often partitioned into a hierarchical 
directory structure. This change will allow spark streaming to monitor all 
sub-directories of a parent directory to find new files as they are added. 

   Priority: Major  (was: Minor)

 Add recursive directory file search to fileInputStream
 --

 Key: SPARK-1795
 URL: https://issues.apache.org/jira/browse/SPARK-1795
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Rick OToole

 When writing logs, they are often partitioned into a hierarchical directory 
 structure. This change will allow spark streaming to monitor all 
 sub-directories of a parent directory to find new files as they are added. 
 See https://github.com/apache/spark/pull/537



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1805) Error launching cluster when master and slaves machines are of different visualization types

2014-05-12 Thread Han JU (JIRA)
Han JU created SPARK-1805:
-

 Summary: Error launching cluster when master and slaves machines 
are of different visualization types
 Key: SPARK-1805
 URL: https://issues.apache.org/jira/browse/SPARK-1805
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Han JU
Priority: Minor


In current EC2 script, the AMI image object is loaded only once. This is ok 
when master and slave machines are of the same visualization type (pvm or hvm). 
But this won't work if, say, master is pvm and slaves are hvm since the AMI is 
not compatible between these two kinds of visualization. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1803) Rename test resources to be compatible with Windows FS

2014-05-12 Thread Stevo Slavic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995152#comment-13995152
 ] 

Stevo Slavic commented on SPARK-1803:
-

Created pull request with fix for this issue (see 
[here|https://github.com/apache/spark/pull/739]).

 Rename test resources to be compatible with Windows FS
 --

 Key: SPARK-1803
 URL: https://issues.apache.org/jira/browse/SPARK-1803
 Project: Spark
  Issue Type: Task
  Components: Windows
Affects Versions: 0.9.1
Reporter: Stevo Slavic
Priority: Trivial

 {{git clone}} of master branch and then {{git status}} on Windows reports 
 untracked files:
 {noformat}
 # Untracked files:
 #   (use git add file... to include in what will be committed)
 #
 #   sql/hive/src/test/resources/golden/Column pruning
 #   sql/hive/src/test/resources/golden/Partition pruning
 #   sql/hive/src/test/resources/golden/Partiton pruning
 {noformat}
 Actual issue is that several files under 
 {{sql/hive/src/test/resources/golden}} directory have colon in name which is 
 invalid character in file name on Windows.
 Please have these files renamed to a Windows compatible file name.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1763) SparkSubmit arguments do not propagate to python files on YARN

2014-05-12 Thread Andrew Or (JIRA)
Andrew Or created SPARK-1763:


 Summary: SparkSubmit arguments do not propagate to python files on 
YARN
 Key: SPARK-1763
 URL: https://issues.apache.org/jira/browse/SPARK-1763
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, YARN
Affects Versions: 0.9.1
Reporter: Andrew Or
Priority: Blocker
 Fix For: 1.0.0


The python SparkConf load defaults does not pick up system properties set by 
SparkSubmit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1755) Spark-submit --name does not resolve to application name on YARN

2014-05-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1755:
-

Fix Version/s: (was: 1.0.1)
   1.0.0

 Spark-submit --name does not resolve to application name on YARN
 

 Key: SPARK-1755
 URL: https://issues.apache.org/jira/browse/SPARK-1755
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
Reporter: Andrew Or
 Fix For: 1.0.0


 In YARN client mode, --name is ignored because the deploy mode is client, and 
 the name is for some reason a [cluster 
 config|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L170)].
 In YARN cluster mode, --name is passed to the 
 org.apache.spark.deploy.yarn.Client as a command line argument. The Client 
 class, however, uses this name only as the [app name for the 
 RM|https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L80],
  but not for Spark. In other words, when SparkConf attempts to load default 
 configs, application name is not set.
 In both cases, passing --name to SparkSubmit does not actually cause Spark to 
 adopt it as its application name, despite what the usage promises.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1764) EOF reached before Python server acknowledged

2014-05-12 Thread Bouke van der Bijl (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bouke van der Bijl updated SPARK-1764:
--

Priority: Blocker  (was: Critical)

 EOF reached before Python server acknowledged
 -

 Key: SPARK-1764
 URL: https://issues.apache.org/jira/browse/SPARK-1764
 Project: Spark
  Issue Type: Bug
  Components: Mesos, PySpark
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
Priority: Blocker
  Labels: mesos, pyspark

 I'm getting EOF reached before Python server acknowledged while using 
 PySpark on Mesos. The error manifests itself in multiple ways. One is:
 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
 failed due to the error EOF reached before Python server acknowledged; 
 shutting down SparkContext
 And the other has a full stacktrace:
 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
 acknowledged
 org.apache.spark.SparkException: EOF reached before Python server acknowledged
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
   at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This error causes the SparkContext to shutdown. I have not been able to 
 reliably reproduce this bug, it seems to happen randomly, but if you run 
 enough tasks on a SparkContext it'll hapen eventually



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1764) EOF reached before Python server acknowledged

2014-05-12 Thread Bouke van der Bijl (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992994#comment-13992994
 ] 

Bouke van der Bijl edited comment on SPARK-1764 at 5/8/14 6:31 PM:
---

I can semi-reliably recreate this by just running this code:

{{quote}}
while True:
  sc.parallelize(range(100)).map(lambda n: n * 2).collect()
{{quote}}

Running this on Mesos will eventually crash with 

Py4JJavaError: An error occurred while calling o1142.collect.
: org.apache.spark.SparkException: Job 101 cancelled as part of cancellation of 
all jobs
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:499)
at 
org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1151)
at 
org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1147)
at akka.actor.SupervisorStrategy.handleFailure(FaultHandling.scala:295)
at 
akka.actor.dungeon.FaultHandling$class.handleFailure(FaultHandling.scala:253)
at akka.actor.ActorCell.handleFailure(ActorCell.scala:338)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:423)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
at akka.dispatch.Mailbox.run(Mailbox.scala:218)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


I0508 18:29:03.623627  7868 sched.cpp:730] Stopping framework 
'20140508-173240-16842879-5050-24645-0032'
14/05/08 18:29:04 ERROR OneForOneStrategy: EOF reached before Python server 
acknowledged
org.apache.spark.SparkException: EOF reached before Python server acknowledged
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
at 
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


was (Author: bouk):
I can semi-reliably recreate this by just running 

[jira] [Created] (SPARK-1806) Upgrade to Mesos 0.18.1 with Shaded Protobuf

2014-05-12 Thread Bernardo Gomez Palacio (JIRA)
Bernardo Gomez Palacio created SPARK-1806:
-

 Summary: Upgrade to Mesos 0.18.1 with Shaded Protobuf
 Key: SPARK-1806
 URL: https://issues.apache.org/jira/browse/SPARK-1806
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0, 1.0.1
Reporter: Bernardo Gomez Palacio


Upgrade Spark to depend on Mesos 0.18.1 with shaded protobuf. This version of 
Mesos does not externalize its dependency on the protobuf version (now shaded 
through the namespace org.apache.mesos.protobuf) and therefore facilitates 
integration with systems that do depend on specific versions of protobufs such 
as Hadoop 1.0.x, 2.x, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1772) Spark executors do not successfully die on OOM

2014-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1772.


   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 715
[https://github.com/apache/spark/pull/715]

 Spark executors do not successfully die on OOM
 --

 Key: SPARK-1772
 URL: https://issues.apache.org/jira/browse/SPARK-1772
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aaron Davidson
 Fix For: 1.0.0


 Executor catches Throwable, and does not always die when JVM fatal exceptions 
 occur. This is a problem because any subsequent use of these Executors are 
 very likely to fail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1806) Upgrade to Mesos 0.18.1 with Shaded Protobuf

2014-05-12 Thread Bernardo Gomez Palacio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995265#comment-13995265
 ] 

Bernardo Gomez Palacio commented on SPARK-1806:
---

Should close SPARK-1433

 Upgrade to Mesos 0.18.1 with Shaded Protobuf
 

 Key: SPARK-1806
 URL: https://issues.apache.org/jira/browse/SPARK-1806
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0, 1.0.1
Reporter: Bernardo Gomez Palacio
  Labels: mesos

 Upgrade Spark to depend on Mesos 0.18.1 with shaded protobuf. This version of 
 Mesos does not externalize its dependency on the protobuf version (now shaded 
 through the namespace org.apache.mesos.protobuf) and therefore facilitates 
 integration with systems that do depend on specific versions of protobufs 
 such as Hadoop 1.0.x, 2.x, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1749) DAGScheduler supervisor strategy broken with Mesos

2014-05-12 Thread Bouke van der Bijl (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993602#comment-13993602
 ] 

Bouke van der Bijl commented on SPARK-1749:
---

This isn't really PySpark specific, this works fine on other backends which 
will mark the task as failed and just keep the SparkContext running.

It shouldn't be shutting down the whole SparkContext just because a single job 
failed

 DAGScheduler supervisor strategy broken with Mesos
 --

 Key: SPARK-1749
 URL: https://issues.apache.org/jira/browse/SPARK-1749
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Spark Core
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
Assignee: Mark Hamstra
Priority: Blocker
  Labels: mesos, scheduler, scheduling

 Any bad Python code will trigger this bug, for example 
 `sc.parallelize(range(100)).map(lambda n: undefined_variable * 2).collect()` 
 will cause a `undefined_variable isn't defined`, which will cause spark to 
 try to kill the task, resulting in the following stacktrace:
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:184)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:182)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:182)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:182)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:175)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:175)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1058)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:499)
   at 
 org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1151)
   at 
 org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1147)
   at akka.actor.SupervisorStrategy.handleFailure(FaultHandling.scala:295)
   at 
 akka.actor.dungeon.FaultHandling$class.handleFailure(FaultHandling.scala:253)
   at akka.actor.ActorCell.handleFailure(ActorCell.scala:338)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:423)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
   at akka.dispatch.Mailbox.run(Mailbox.scala:218)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This is because killTask isn't implemented for the MesosSchedulerBackend. I 
 assume this isn't pyspark-specific, as there will be other 

[jira] [Commented] (SPARK-1806) Upgrade to Mesos 0.18.1 with Shaded Protobuf

2014-05-12 Thread Bernardo Gomez Palacio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995382#comment-13995382
 ] 

Bernardo Gomez Palacio commented on SPARK-1806:
---

Thanks [~pwendell] for addressing this so quickly!

 Upgrade to Mesos 0.18.1 with Shaded Protobuf
 

 Key: SPARK-1806
 URL: https://issues.apache.org/jira/browse/SPARK-1806
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0, 1.0.1
Reporter: Bernardo Gomez Palacio
  Labels: mesos
 Fix For: 1.0.0


 Upgrade Spark to depend on Mesos 0.18.1 with shaded protobuf. This version of 
 Mesos does not externalize its dependency on the protobuf version (now shaded 
 through the namespace org.apache.mesos.protobuf) and therefore facilitates 
 integration with systems that do depend on specific versions of protobufs 
 such as Hadoop 1.0.x, 2.x, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1806) Upgrade to Mesos 0.18.1 with Shaded Protobuf

2014-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1806.


   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 741
[https://github.com/apache/spark/pull/741]

 Upgrade to Mesos 0.18.1 with Shaded Protobuf
 

 Key: SPARK-1806
 URL: https://issues.apache.org/jira/browse/SPARK-1806
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0, 1.0.1
Reporter: Bernardo Gomez Palacio
  Labels: mesos
 Fix For: 1.0.0


 Upgrade Spark to depend on Mesos 0.18.1 with shaded protobuf. This version of 
 Mesos does not externalize its dependency on the protobuf version (now shaded 
 through the namespace org.apache.mesos.protobuf) and therefore facilitates 
 integration with systems that do depend on specific versions of protobufs 
 such as Hadoop 1.0.x, 2.x, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged

2014-05-12 Thread Bouke van der Bijl (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995421#comment-13995421
 ] 

Bouke van der Bijl commented on SPARK-1764:
---

I did some more digging into this and I have no idea what's the exact issue. 
The write to the Python server succeeds (which I checked from the Python side) 
but the Scala side doesn't seem to be able to read the acknowledgement. 

I have also confirmed that it isn't an issue with the Python broadcast server 
dying, as commented out the exception makes it work fine (!) 

 EOF reached before Python server acknowledged
 -

 Key: SPARK-1764
 URL: https://issues.apache.org/jira/browse/SPARK-1764
 Project: Spark
  Issue Type: Bug
  Components: Mesos, PySpark
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
Priority: Blocker
  Labels: mesos, pyspark

 I'm getting EOF reached before Python server acknowledged while using 
 PySpark on Mesos. The error manifests itself in multiple ways. One is:
 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
 failed due to the error EOF reached before Python server acknowledged; 
 shutting down SparkContext
 And the other has a full stacktrace:
 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
 acknowledged
 org.apache.spark.SparkException: EOF reached before Python server acknowledged
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
   at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This error causes the SparkContext to shutdown. I have not been able to 
 reliably reproduce this bug, it seems to happen randomly, but if you run 
 enough tasks on a SparkContext it'll hapen eventually



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1802:
-

Attachment: hive-exec-jar-problems.txt

 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0

 Attachments: hive-exec-jar-problems.txt


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation suggests we may need to take some precaution 
 surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1801) Open up some private APIs related to creating new RDDs for developers

2014-05-12 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1801:


Summary: Open up some private APIs related to creating new RDDs for 
developers  (was: Open up sime private APIs related to creating new RDDs for 
developers)

 Open up some private APIs related to creating new RDDs for developers
 -

 Key: SPARK-1801
 URL: https://issues.apache.org/jira/browse/SPARK-1801
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: koert kuipers
Priority: Minor

 in writing my own RDD i ran into a few issues with respect to stuff being 
 private in spark.
 in compute i would like to return an iterator that respects task killing (as 
 HadoopRDD does), but the mechanics for that are inside the private 
 InterruptibleIterator. also the exception i am supposed to throw 
 (TaskKilledException) is private to spark.
 See also:
 http://apache-spark-user-list.1001560.n3.nabble.com/Re-writing-my-own-RDD-td5558.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1736) spark-submit on Windows

2014-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1736.


Resolution: Fixed

 spark-submit on Windows
 ---

 Key: SPARK-1736
 URL: https://issues.apache.org/jira/browse/SPARK-1736
 Project: Spark
  Issue Type: Improvement
  Components: Windows
Reporter: Matei Zaharia
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.0.0


 - spark-submit needs a Windows version (shouldn't be too hard, it's just 
 launching a Java process)
 - spark-shell.cmd needs to run through spark-submit like it does on Unix



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1652) Fixes and improvements for spark-submit/configs

2014-05-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995892#comment-13995892
 ] 

Patrick Wendell commented on SPARK-1652:


The remaining issues here all have work-arounds in 1.0. So I'm bumping this to 
1.1

 Fixes and improvements for spark-submit/configs
 ---

 Key: SPARK-1652
 URL: https://issues.apache.org/jira/browse/SPARK-1652
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.1.0


 These are almost all a result of my config patch. Unfortunately the changes 
 were difficult to unit-test and there several edge cases reported.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1802:
---

Assignee: Sean Owen

 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation suggests we may need to take some precaution 
 surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1815) SparkContext's constructor that only takes a SparkConf shouldn't be a DeveloperApi

2014-05-12 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-1815:
-

 Summary: SparkContext's constructor that only takes a SparkConf 
shouldn't be a DeveloperApi
 Key: SPARK-1815
 URL: https://issues.apache.org/jira/browse/SPARK-1815
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza
 Fix For: 1.0.0


It's the constructor used in the examples.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1811) Support resizable output buffer for kryo serializer

2014-05-12 Thread koert kuipers (JIRA)
koert kuipers created SPARK-1811:


 Summary: Support resizable output buffer for kryo serializer
 Key: SPARK-1811
 URL: https://issues.apache.org/jira/browse/SPARK-1811
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: koert kuipers
Priority: Minor


Currently the size of kryo serializer output buffer can be set with 
spark.kryoserializer.buffer.mb

The issue with this setting is that it has to be one-size-fits-all, so it ends 
up being the maximum size needed, even if only a single task out of many needs 
it to be that big. A resizable buffer will allow most tasks to use a modest 
sized buffer while the incidental task that needs a really big buffer can get 
it at a cost (allocating a new buffer and copying the contents over repeatedly 
as the buffer grows... with each new allocation the size doubles).

The class used for the buffer is kryo Output, which supports resizing if  
maxCapacity is set bigger than capacity. I suggest we provide a setting 
spark.kryoserializer.buffer.max.mb which defaults to 
spark.kryoserializer.buffer.mb, and which sets Output's maxCapacity.

Pull request for this jira:
https://github.com/apache/spark/pull/735






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1680) Clean up use of setExecutorEnvs in SparkConf

2014-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1680:
---

Description: We should make this consistent between YARN and Standalone. 
Basically, YARN mode should just use the executorEnvs from the Spark conf and 
not need SPARK_YARN_USER_ENV.  (was: We should make this consistent between 
YARN and SparkConf. Basically, YARN mode should just use the executorEnvs from 
the Spark conf and not need SPARK_YARN_USER_ENV.)

 Clean up use of setExecutorEnvs in SparkConf 
 -

 Key: SPARK-1680
 URL: https://issues.apache.org/jira/browse/SPARK-1680
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 We should make this consistent between YARN and Standalone. Basically, YARN 
 mode should just use the executorEnvs from the Spark conf and not need 
 SPARK_YARN_USER_ENV.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1798) Tests should clean up temp files

2014-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1798.


   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 732
[https://github.com/apache/spark/pull/732]

 Tests should clean up temp files
 

 Key: SPARK-1798
 URL: https://issues.apache.org/jira/browse/SPARK-1798
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Sean Owen
Priority: Minor
 Fix For: 1.0.0


 Three issues related to temp files that tests generate -- these should be 
 touched up for hygiene but are not urgent.
 Modules have a log4j.properties which directs the unit-test.log output file 
 to a directory like [module]/target/unit-test.log. But this ends up creating 
 [module]/[module]/target/unit-test.log instead of former.
 The work/ directory is not deleted by mvn clean, in the parent and in 
 modules. Neither is the checkpoint/ directory created under the various 
 external modules.
 Many tests create a temp directory, which is not usually deleted. This can be 
 largely resolved by calling deleteOnExit() at creation and trying to call 
 Utils.deleteRecursively consistently to clean up, sometimes in an @After 
 method.
 (If anyone seconds the motion, I can create a more significant change that 
 introduces a new test trait along the lines of LocalSparkContext, which 
 provides management of temp directories for subclasses to take advantage of.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1773) Standalone cluster docs should be updated to reflect Spark Submit

2014-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1773.


Resolution: Fixed

 Standalone cluster docs should be updated to reflect Spark Submit
 -

 Key: SPARK-1773
 URL: https://issues.apache.org/jira/browse/SPARK-1773
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1771) CoarseGrainedSchedulerBackend is not resilient to Akka restarts

2014-05-12 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995962#comment-13995962
 ] 

Nan Zhu commented on SPARK-1771:


[#Aaron Davidson], I think there are basically two ways to fix this bug, which 
depends on whether we want to allow the restarting of the driver

1. assume we allow the restarting, we may need something similar to the 
persistentEngine in the deploy package

2. if not, we can introduce a supervisor actor to stop the DriverActor and kill 
the executorsjust similar with what we just did in the DAGScheduler

 CoarseGrainedSchedulerBackend is not resilient to Akka restarts
 ---

 Key: SPARK-1771
 URL: https://issues.apache.org/jira/browse/SPARK-1771
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aaron Davidson

 The exception reported in SPARK-1769 was propagated through the 
 CoarseGrainedSchedulerBackend, and caused an Actor restart of the 
 DriverActor. Unfortunately, this actor does not seem to have been written 
 with Akka restartability in mind. For instance, the new DriverActor has lost 
 all state about the prior Executors without cleanly disconnecting them. This 
 means that the driver actually has executors attached to it, but doesn't 
 think it does, which leads to mayhem of various sorts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1652) Fixes and improvements for spark-submit/configs

2014-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1652:
---

Fix Version/s: (was: 1.0.0)
   1.1.0

 Fixes and improvements for spark-submit/configs
 ---

 Key: SPARK-1652
 URL: https://issues.apache.org/jira/browse/SPARK-1652
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.1.0


 These are almost all a result of my config patch. Unfortunately the changes 
 were difficult to unit-test and there several edge cases reported.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1779) Warning when spark.storage.memoryFraction is not between 0 and 1

2014-05-12 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993615#comment-13993615
 ] 

Guoqiang Li commented on SPARK-1779:


I think here should throw an exception.

 Warning when spark.storage.memoryFraction is not between 0 and 1
 

 Key: SPARK-1779
 URL: https://issues.apache.org/jira/browse/SPARK-1779
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: wangfei
 Fix For: 1.1.0


 There should be a warning when memoryFraction is lower than 0 or greater than 
 1



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-1802:



Let's keep this open given the ongoing discussion.

 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0

 Attachments: hive-exec-jar-problems.txt


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation suggests we may need to take some precaution 
 surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1757) Support saving null primitives with .saveAsParquetFile()

2014-05-12 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash resolved SPARK-1757.
---

   Resolution: Fixed
Fix Version/s: 1.0.0

https://github.com/apache/spark/pull/690

 Support saving null primitives with .saveAsParquetFile()
 

 Key: SPARK-1757
 URL: https://issues.apache.org/jira/browse/SPARK-1757
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Andrew Ash
 Fix For: 1.0.0


 See stack trace below:
 {noformat}
 14/05/07 21:45:51 INFO analysis.Analyzer: Max iterations (2) reached for 
 batch MultiInstanceRelations
 14/05/07 21:45:51 INFO analysis.Analyzer: Max iterations (2) reached for 
 batch CaseInsensitiveAttributeReferences
 14/05/07 21:45:51 INFO optimizer.Optimizer$: Max iterations (2) reached for 
 batch ConstantFolding
 14/05/07 21:45:51 INFO optimizer.Optimizer$: Max iterations (2) reached for 
 batch Filter Pushdown
 java.lang.RuntimeException: Unsupported datatype StructType(List())
 at scala.sys.package$.error(package.scala:27)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetRelation.scala:201)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetRelation.scala:234)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetRelation.scala:267)
 at 
 org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:143)
 at 
 org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:122)
 at 
 org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:139)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:264)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:264)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:265)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:265)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:268)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:268)
 at 
 org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:66)
 at 
 org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:96)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1814) Splash page should include correct syntax for launching examples

2014-05-12 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1814:
--

 Summary: Splash page should include correct syntax for launching 
examples
 Key: SPARK-1814
 URL: https://issues.apache.org/jira/browse/SPARK-1814
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged

2014-05-12 Thread Bouke van der Bijl (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995632#comment-13995632
 ] 

Bouke van der Bijl commented on SPARK-1764:
---

Interesting, including SPARK-1806 in our build made it stop failing... I guess 
this can be considered fixed then

 EOF reached before Python server acknowledged
 -

 Key: SPARK-1764
 URL: https://issues.apache.org/jira/browse/SPARK-1764
 Project: Spark
  Issue Type: Bug
  Components: Mesos, PySpark
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
Priority: Blocker
  Labels: mesos, pyspark

 I'm getting EOF reached before Python server acknowledged while using 
 PySpark on Mesos. The error manifests itself in multiple ways. One is:
 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
 failed due to the error EOF reached before Python server acknowledged; 
 shutting down SparkContext
 And the other has a full stacktrace:
 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
 acknowledged
 org.apache.spark.SparkException: EOF reached before Python server acknowledged
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
   at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This error causes the SparkContext to shutdown. I have not been able to 
 reliably reproduce this bug, it seems to happen randomly, but if you run 
 enough tasks on a SparkContext it'll hapen eventually



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1765) Modify a typo in monitoring.md

2014-05-12 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996012#comment-13996012
 ] 

Andrew Ash commented on SPARK-1765:
---

https://github.com/apache/spark/pull/698

This can now be closed

 Modify a typo in monitoring.md
 --

 Key: SPARK-1765
 URL: https://issues.apache.org/jira/browse/SPARK-1765
 Project: Spark
  Issue Type: Bug
Reporter: Kousuke Saruta
Priority: Minor

 There is a word 'JXM' In monitoring.md.
 I guess, it's a typo for 'JMX'.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1816) LiveListenerBus dies if a listener throws an exception

2014-05-12 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-1816:
-

 Summary: LiveListenerBus dies if a listener throws an exception
 Key: SPARK-1816
 URL: https://issues.apache.org/jira/browse/SPARK-1816
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Andrew Or
Priority: Critical


The exception isn't even printed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)