[jira] [Created] (SPARK-1387) Update build plugins, avoid plugin version warning, centralize versions
Sean Owen created SPARK-1387: Summary: Update build plugins, avoid plugin version warning, centralize versions Key: SPARK-1387 URL: https://issues.apache.org/jira/browse/SPARK-1387 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Priority: Minor Another handful of small build changes to organize and standardize a bit, and avoid warnings: - Update Maven plugin versions for good measure - Since plugins need maven 3.0.4 already, require it explicitly (3.0.4 had some bugs anyway) - Use variables to define versions across dependencies where they should move in lock step - ... and make this consistent between Maven/SBT -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS
[ https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957113#comment-13957113 ] Sean Owen commented on SPARK-1355: -- April Fools, apparently. Though this was opened on 30 March? Switch website to the Apache CMS Key: SPARK-1355 URL: https://issues.apache.org/jira/browse/SPARK-1355 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Joe Schaefer Jekyll is ancient history useful for small blogger sites and little else. Why not upgrade to the Apache CMS? It supports the same on-disk format for .md files and interfaces with pygments for code highlighting. Thrift recently switched from nanoc to the CMS and loves it! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark
[ https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957330#comment-13957330 ] Sean Owen commented on SPARK-1388: -- Yes this should be resolved as a duplicate instead. ConcurrentModificationException in hadoop_common exposed by Spark - Key: SPARK-1388 URL: https://issues.apache.org/jira/browse/SPARK-1388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Nishkam Ravi Attachments: nravi_Conf_Spark-1388.patch The following exception occurs non-deterministically: java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926) at java.util.HashMap$KeyIterator.next(HashMap.java:960) at java.util.AbstractCollection.addAll(AbstractCollection.java:341) at java.util.HashSet.init(HashSet.java:117) at org.apache.hadoop.conf.Configuration.init(Configuration.java:671) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:439) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:154) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957450#comment-13957450 ] Sean Owen commented on SPARK-1391: -- Oops yes I mean offset of course. Good investigation there. I am also not sure why the index would not show. BlockManager cannot transfer blocks larger than 2G in size -- Key: SPARK-1391 URL: https://issues.apache.org/jira/browse/SPARK-1391 Project: Spark Issue Type: Bug Components: Block Manager, Shuffle Affects Versions: 1.0.0 Reporter: Shivaram Venkataraman If a task tries to remotely access a cached RDD block, I get an exception when the block size is 2G. The exception is pasted below. Memory capacities are huge these days ( 60G), and many workflows depend on having large blocks in memory, so it would be good to fix this bug. I don't know if the same thing happens on shuffles if one transfer (from mapper to reducer) is 2G. 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer message java.lang.ArrayIndexOutOfBoundsException at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) at org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) at org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) at org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) at org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957949#comment-13957949 ] Sean Owen commented on SPARK-1391: -- Of course! Hardly my issue. Well, you could try my patch that replaces fastutil with alternatives. I doubt the standard ByteArrayOutputStream does differently though? But we are always going to have a problem in that a Java byte array can only be so big because of the size of an int, regardless of stream position issues. This one could be deeper. BlockManager cannot transfer blocks larger than 2G in size -- Key: SPARK-1391 URL: https://issues.apache.org/jira/browse/SPARK-1391 Project: Spark Issue Type: Bug Components: Block Manager, Shuffle Affects Versions: 1.0.0 Reporter: Shivaram Venkataraman If a task tries to remotely access a cached RDD block, I get an exception when the block size is 2G. The exception is pasted below. Memory capacities are huge these days ( 60G), and many workflows depend on having large blocks in memory, so it would be good to fix this bug. I don't know if the same thing happens on shuffles if one transfer (from mapper to reducer) is 2G. 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer message java.lang.ArrayIndexOutOfBoundsException at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) at org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) at org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) at org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) at org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1441) Compile Spark Core error with Hadoop 0.23.x
[ https://issues.apache.org/jira/browse/SPARK-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13963016#comment-13963016 ] Sean Owen commented on SPARK-1441: -- I see, I had taken the yarn-alpha profile to be the slightly-misnamed profile you would use when building the whole project with 0.23, and building this distro means building everything. At least, that's a fairly fine solution, no? you should set yarn.version to 0.23.9 too. Compile Spark Core error with Hadoop 0.23.x --- Key: SPARK-1441 URL: https://issues.apache.org/jira/browse/SPARK-1441 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: witgo Attachments: mvn.log, sbt.log -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1357) [MLLIB] Annotate developer and experimental API's
[ https://issues.apache.org/jira/browse/SPARK-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13963999#comment-13963999 ] Sean Owen commented on SPARK-1357: -- I know I'm late to this party, but I just had a look and wanted to throw out a few last minute ideas. (Do you not want to just declare all of MLlib experimental? is it really 1.0? that's a fairly significant set of shackles to put on for a long time.) OK, that aside, I have two suggestions to mark as experimental: 1. ALS Rating object assumes users and items are Int. I suggest it will be eventually interesting to support String, or at least switch to Long. 2. Per old MLLIB-29, I feel pretty certain that ClassificationModel can't return RDD[Double], and will want to support returning a distribution over labels at some point. Similarly the input to it and RegressionModel seems like it will have to change to encompass something more than Vector to properly allow for categorical values. DecisionTreeModel has the same issue but is experimental (and doesn't integrate with these APIs?) The point is not so much whether one agrees with these, but whether there is a non-trivial chance of wanting to change something this year. Other parts that I'm interested in personally look pretty strong. Humbly submitted. [MLLIB] Annotate developer and experimental API's - Key: SPARK-1357 URL: https://issues.apache.org/jira/browse/SPARK-1357 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Xiangrui Meng Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964638#comment-13964638 ] Sean Owen commented on SPARK-1406: -- Yes I understand transformations can be described in PMML. Do you mean parsing a transformation described in PMML and implementing the transformation? Yes that goes hand in hand with supporting import of a model in general. I would merely suggest this is a step that comes after several others in order of priority, like: - implementing feature transformations in the abstract in the code base, separately from the idea of PMML - implementing some form of model import via JPMML - implementing more functional in the Model classes to give a reason to want to import an external model into MLlib ... and to me this is less useful at this point than export too. I say this because the power of MLlib/Spark right now is perceived to be model building, making it more producer than consumer at this stage. PMML model evaluation support via MLib -- Key: SPARK-1406 URL: https://issues.apache.org/jira/browse/SPARK-1406 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Thomas Darimont It would be useful if spark would provide support the evaluation of PMML models (http://www.dmg.org/v4-2/GeneralStructure.html). This would allow to use analytical models that were created with a statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which would perform the actual model evaluation for a given input tuple. The PMML model would then just contain the parameterization of an analytical model. Other projects like JPMML-Evaluator do a similar thing. https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1357) [MLLIB] Annotate developer and experimental API's
[ https://issues.apache.org/jira/browse/SPARK-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964649#comment-13964649 ] Sean Owen commented on SPARK-1357: -- Yeah I think it's reasonable to say that the core ALS API is only in terms of numeric IDs and leave a higher-level translation to the caller. Longs give that much more space to hash into. The cost in terms of memory of something like a String is just a reference, so roughly the same as a Double anyway. I think the more important question is whether Double is too hacky API-wise as a representation of fundamentally non-numeric data. That's up for debate, but yeah the question here is more about reserving the right to change. I'll submit a PR that marks the items I mention as experimental, for consideration. See if it seems reasonable. [MLLIB] Annotate developer and experimental API's - Key: SPARK-1357 URL: https://issues.apache.org/jira/browse/SPARK-1357 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Xiangrui Meng Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1437) Jenkins should build with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967522#comment-13967522 ] Sean Owen commented on SPARK-1437: -- Pardon my boldness in pushing this onto your plate pwendell, but might be a very quick fix in Jenkins. If Travis CI is going to be activated, it can definitely be configured to build with Java 6 and 7 both. Jenkins should build with Java 6 Key: SPARK-1437 URL: https://issues.apache.org/jira/browse/SPARK-1437 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Assignee: Patrick Wendell Priority: Minor Labels: javac, jenkins Attachments: Screen Shot 2014-04-07 at 22.53.56.png Apologies if this was already on someone's to-do list, but I wanted to track this, as it bit two commits in the last few weeks. Spark is intended to work with Java 6, and so compiles with source/target 1.6. Java 7 can correctly enforce Java 6 language rules and emit Java 6 bytecode. However, unless otherwise configured with -bootclasspath, javac will use its own (Java 7) library classes. This means code that uses classes in Java 7 will be allowed to compile, but the result will fail when run on Java 6. This is why you get warnings like ... Using /usr/java/jdk1.7.0_51 as default JAVA_HOME. ... [warn] warning: [options] bootstrap class path not set in conjunction with -source 1.6 The solution is just to tell Jenkins to use Java 6. This may be stating the obvious, but it should just be a setting under Configure for SparkPullRequestBuilder. In our Jenkinses, JDK 6/7/8 are set up; if it's not an option already I'm guessing it's not too hard to get Java 6 configured on the Amplab machines. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1479) building spark on 2.0.0-cdh4.4.0 failed
[ https://issues.apache.org/jira/browse/SPARK-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967784#comment-13967784 ] Sean Owen commented on SPARK-1479: -- yarn is the slightly more appropriate profile, but, read: https://github.com/apache/spark/pull/151 What Spark doesn't quite support is YARN beta and that's what you've got on your hands here. FWIW I am in favor of the change in this PR to make it all work. Soon, support for YARN alpha/beta can just go away anyway. If you are interested in CDH, the best thing is moving to CDH5, which already has Spark set up in standalone mode, and which has YARN stable. It also works with CDH 4.6 in standalone mode as a parcel. building spark on 2.0.0-cdh4.4.0 failed --- Key: SPARK-1479 URL: https://issues.apache.org/jira/browse/SPARK-1479 Project: Spark Issue Type: Question Environment: 2.0.0-cdh4.4.0 Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL spark 0.9.1 java version 1.6.0_32 Reporter: jackielihf Attachments: mvn.log [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile (scala-compile-first) on project spark-yarn-alpha_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. CompileFailed - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile (scala-compile-first) on project spark-yarn-alpha_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196) at org.apache.maven.cli.MavenCli.main(MavenCli.java:141) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352) Caused by: org.apache.maven.plugin.PluginExecutionException: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:110) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209) ... 19 more Caused by: Compilation failed at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:76) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:35) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:29) at sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply$mcV$sp(AggressiveCompile.scala:71) at sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71) at sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71) at sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:101) at sbt.compiler.AggressiveCompile$$anonfun$4.compileScala$1(AggressiveCompile.scala:70) at sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:88) at sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:60)
[jira] [Created] (SPARK-1488) Resolve scalac feature warnings during build
Sean Owen created SPARK-1488: Summary: Resolve scalac feature warnings during build Key: SPARK-1488 URL: https://issues.apache.org/jira/browse/SPARK-1488 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Priority: Minor For your consideration: scalac currently notes a number of feature warnings during compilation: {code} [warn] there were 65 feature warning(s); re-run with -feature for details {code} Warnings are like: {code} [warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled [warn] by making the implicit value scala.language.implicitConversions visible. [warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions' [warn] or by setting the compiler option -language:implicitConversions. [warn] See the Scala docs for value scala.language.implicitConversions for a discussion [warn] why the feature should be explicitly enabled. [warn] implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) = [warn]^ {code} scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used. The accompanying PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1439) Aggregate Scaladocs across projects
[ https://issues.apache.org/jira/browse/SPARK-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971662#comment-13971662 ] Sean Owen commented on SPARK-1439: -- I had a run at this today. First I tried Maven-based formulas, but didn't quite do the trick. I made some progress with unidoc although not all the way. Maybe an SBT expert can help me figure how to finish it. *Maven* http://stackoverflow.com/questions/12301620/how-to-generate-an-aggregated-scaladoc-for-a-maven-site This works, but, generates *javadoc* for everything, including Scala source. The resulting javadoc is not so helpful. It also complains a lot about not finding references since javadoc doesn't quite understand links in the same way. *Maven #2* You can also invoke the scala-maven-plugin 'doc' goal as part of the site generation: {code:xml} reporting plugins ... plugin groupIdnet.alchim31.maven/groupId artifactIdscala-maven-plugin/artifactId reportSets reportSet reports reportdoc/report /reports /reportSet /reportSets /plugin /plugins /reporting {code} It lacks a goal like aggregate that the javadoc plugin has, which takes care of combining everything into one set of docs. This only generates scaladoc in each module in exploded format. *Unidoc / SBT* It is almost as easy as: - adding the plugin to plugins.sbt: {{addSbtPlugin(com.eed3si9n % sbt-unidoc % 0.3.0)}} - {{import sbtunidoc.Plugin.\_}} and {{UnidocKeys.\_}} in SparkBuild.scala - adding ++ unidocSettings to rootSettings in SparkBuild.scala but it was also necessary to: - {{SPARK_YARN=true}} and {{SPARK_HADOOP_VERSION=2.2.0}}, for example, to make YARN scaladoc work - Exclude {{yarn-alpha}} since scaladoc doesn't like the collision of class names: {code} def rootSettings = sharedSettings ++ unidocSettings ++ Seq( unidocProjectFilter in (ScalaUnidoc, unidoc) := inAnyProject -- inProjects(yarnAlpha), publish := {} ) {code} I still get SBT errors since I think this is not quite correctly finessing the build. But it seems almost there. Aggregate Scaladocs across projects --- Key: SPARK-1439 URL: https://issues.apache.org/jira/browse/SPARK-1439 Project: Spark Issue Type: Sub-task Components: Documentation Reporter: Matei Zaharia Fix For: 1.0.0 Apparently there's a Unidoc plugin to put together ScalaDocs across modules: https://github.com/akka/akka/blob/master/project/Unidoc.scala -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1464) Update MLLib Examples to Use Breeze
[ https://issues.apache.org/jira/browse/SPARK-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972342#comment-13972342 ] Sean Owen commented on SPARK-1464: -- This is a duplicate of https://issues.apache.org/jira/browse/SPARK-1462 which is resolved now. Update MLLib Examples to Use Breeze --- Key: SPARK-1464 URL: https://issues.apache.org/jira/browse/SPARK-1464 Project: Spark Issue Type: Task Components: MLlib Reporter: Patrick Wendell Assignee: Xiangrui Meng Priority: Blocker Fix For: 1.0.0 If we want to deprecate the vector class we need to update all of the examples to use Breeze. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
[ https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972397#comment-13972397 ] Sean Owen commented on SPARK-1520: -- Madness. One wild guess is that the breeze .jar files have something in META-INF that, when merged together into the assembly jar, conflicts with other META-INF items. In particular I'm thinking of MANIFEST.MF entries. It's worth diffing those if you can from before and after. However this would still require that Java 7 and 6 behave differently with respect to the entries, to explain your findings. It's possible. Your last comment however suggests it's something strange with the byte code that gets output for a few classes. Java 7 is stricter about byte code. For example: https://weblogs.java.net/blog/fabriziogiudici/archive/2012/05/07/understanding-subtle-new-behaviours-jdk-7 However I would think these would manifest as quite different errors. What about running with -verbose:class to print classloading messages? it might point directly to what's failing to load, if that's it. Of course you can always build with Java 6 since that's supposed to be all that's supported/required now (see my other JIRA about making Jenkins do this), although I agree that it would be nice to get to the bottom of this, as there is no obvious reason this shouldn't work. Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6 - Key: SPARK-1520 URL: https://issues.apache.org/jira/browse/SPARK-1520 Project: Spark Issue Type: Bug Components: MLlib, Spark Core Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 This is a real doozie - when compiling a Spark assembly with JDK7, the produced jar does not work well with JRE6. I confirmed the byte code being produced is JDK 6 compatible (major version 50). What happens is that, silently, the JRE will not load any class files from the assembled jar. {code} $ sbt/sbt assembly/assembly $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] [FIFO|FAIR] $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/ui/UIWorkloadGenerator Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.UIWorkloadGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. Program will exit. {code} I also noticed that if the jar is unzipped, and the classpath set to the currently directory, it just works. Finally, if the assembly jar is compiled with JDK6, it also works. The error is seen with any class, not just the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only in master. *Isolation* -I ran a git bisection and this appeared after the MLLib sparse vector patch was merged:- https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534 SPARK-1212 -I narrowed this down specifically to the inclusion of the breeze library. Just adding breeze to an older (unaffected) build triggered the issue.- I've found that if I just unpack and re-pack the jar, it sometimes works: {code} $ cd assembly/target/scala-2.10/ $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # fails $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar * $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # succeeds {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
[ https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972523#comment-13972523 ] Sean Owen commented on SPARK-1520: -- Regarding large numbers of files: are there INDEX.LST files used anywhere in the jars? If this gets munged or truncated while building the assembly jar, that might cause all kinds of havoc. It could be omitted. http://docs.oracle.com/javase/7/docs/technotes/guides/jar/jar.html#Index_File_Specification Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6 - Key: SPARK-1520 URL: https://issues.apache.org/jira/browse/SPARK-1520 Project: Spark Issue Type: Bug Components: MLlib, Spark Core Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 This is a real doozie - when compiling a Spark assembly with JDK7, the produced jar does not work well with JRE6. I confirmed the byte code being produced is JDK 6 compatible (major version 50). What happens is that, silently, the JRE will not load any class files from the assembled jar. {code} $ sbt/sbt assembly/assembly $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] [FIFO|FAIR] $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/ui/UIWorkloadGenerator Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.UIWorkloadGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. Program will exit. {code} I also noticed that if the jar is unzipped, and the classpath set to the currently directory, it just works. Finally, if the assembly jar is compiled with JDK6, it also works. The error is seen with any class, not just the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only in master. *Isolation* -I ran a git bisection and this appeared after the MLLib sparse vector patch was merged:- https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534 SPARK-1212 -I narrowed this down specifically to the inclusion of the breeze library. Just adding breeze to an older (unaffected) build triggered the issue.- I've found that if I just unpack and re-pack the jar (using `jar` from java 6 or 7) it always works: {code} $ cd assembly/target/scala-2.10/ $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # fails $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar * $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # succeeds {code} I also noticed something of note. The Breeze package contains single directories that have huge numbers of files in them (e.g. 2000+ class files in one directory). It's possible we are hitting some weird bugs/corner cases with compatibility of the internal storage format of the jar itself. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1
[ https://issues.apache.org/jira/browse/SPARK-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973067#comment-13973067 ] Sean Owen commented on SPARK-1527: -- {{toString()}} returns {{getPath()}} which may still be relative. {{getAbsolutePath()}} is better, but even {{getCanonicalPath()}} may be better still. rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1 --- Key: SPARK-1527 URL: https://issues.apache.org/jira/browse/SPARK-1527 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Ye Xianjin Priority: Minor Labels: starter Original Estimate: 24h Remaining Estimate: 24h In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala val rootDir0 = Files.createTempDir() rootDir0.deleteOnExit() val rootDir1 = Files.createTempDir() rootDir1.deleteOnExit() val rootDirs = rootDir0.getName + , + rootDir1.getName rootDir0 and rootDir1 are in system's temporary directory. rootDir0.getName will not get the full path of the directory but the last component of the directory. When passing to DiskBlockManage constructor, the DiskBlockerManger creates directories in pwd not the temporary directory. rootDir0.toString will fix this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1
[ https://issues.apache.org/jira/browse/SPARK-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973091#comment-13973091 ] Sean Owen commented on SPARK-1527: -- If the paths are only used locally, then an absolute path never hurts (except to be a bit longer). I assume that since these are references to a temp directory that is by definition only valid locally, that absolute path is the right thing to use. In other cases, similar logic may apply. I could imagine in some cases the right thing to do is transmit a relative path. rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1 --- Key: SPARK-1527 URL: https://issues.apache.org/jira/browse/SPARK-1527 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Ye Xianjin Priority: Minor Labels: starter Original Estimate: 24h Remaining Estimate: 24h In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala val rootDir0 = Files.createTempDir() rootDir0.deleteOnExit() val rootDir1 = Files.createTempDir() rootDir1.deleteOnExit() val rootDirs = rootDir0.getName + , + rootDir1.getName rootDir0 and rootDir1 are in system's temporary directory. rootDir0.getName will not get the full path of the directory but the last component of the directory. When passing to DiskBlockManage constructor, the DiskBlockerManger creates directories in pwd not the temporary directory. rootDir0.toString will fix this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1
[ https://issues.apache.org/jira/browse/SPARK-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973148#comment-13973148 ] Sean Owen commented on SPARK-1527: -- There are a number of other uses of File.getName(), but a quick glance suggests all the others are appropriate. There are a number of other uses of File.toString(), almost all in tests. I suspect the Files in question already have absolute paths, and that even relative paths happen to work fine in a test since the working dir doesn't change. So those could change, but are probably not a concern. The only one that gave me pause was the use in HttpBroadcast.scala, though I suspect it turns out to work fine for similar reasons. If reviewers are interested in changing the toString()s I'll test and submit a PR for that. rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1 --- Key: SPARK-1527 URL: https://issues.apache.org/jira/browse/SPARK-1527 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Ye Xianjin Priority: Minor Labels: starter Original Estimate: 24h Remaining Estimate: 24h In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala val rootDir0 = Files.createTempDir() rootDir0.deleteOnExit() val rootDir1 = Files.createTempDir() rootDir1.deleteOnExit() val rootDirs = rootDir0.getName + , + rootDir1.getName rootDir0 and rootDir1 are in system's temporary directory. rootDir0.getName will not get the full path of the directory but the last component of the directory. When passing to DiskBlockManage constructor, the DiskBlockerManger creates directories in pwd not the temporary directory. rootDir0.toString will fix this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1556) jets3t dependency is outdated
[ https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975957#comment-13975957 ] Sean Owen commented on SPARK-1556: -- Actually, why does Spark have a direct dependency on jets3t at all? it is not used directly in the code. If it's only needed at runtime, it can/should be declared that way. But if the reason it's there is just for Hadoop, then of course hadoop-client is already bringing it in, and should be allowed to bring in the version it wants. jets3t dependency is outdated - Key: SPARK-1556 URL: https://issues.apache.org/jira/browse/SPARK-1556 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0, 1.0.0 Reporter: Nan Zhu Assignee: Nan Zhu Fix For: 1.0.0 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines S3ServiceException/ServiceException is introduced, however, Spark still relies on Jet3st 0.7.x which has no definition of these classes What I met is that [code] 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.SparkContext.runJob(SparkContext.scala:891) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) at .init(console:30) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609) at
[jira] [Commented] (SPARK-1556) jets3t dependency is outdated
[ https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975967#comment-13975967 ] Sean Owen commented on SPARK-1556: -- OK, I partly eat my words. jets3t isn't included by the Hadoop client library it appears. It's only included by the Hadoop server-side components. So yeah Spark has to include jets3t to make s3:// URLs work in the REPL. FWIW I agree with updating the version -- ideally just in the Hadoop 2.2+ profiles. And it should be scoperuntime/scope jets3t dependency is outdated - Key: SPARK-1556 URL: https://issues.apache.org/jira/browse/SPARK-1556 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0, 1.0.0 Reporter: Nan Zhu Assignee: Nan Zhu Fix For: 1.0.0 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines S3ServiceException/ServiceException is introduced, however, Spark still relies on Jet3st 0.7.x which has no definition of these classes What I met is that [code] 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.SparkContext.runJob(SparkContext.scala:891) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) at .init(console:30) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609) at
[jira] [Commented] (SPARK-1605) Improve mllib.linalg.Vector
[ https://issues.apache.org/jira/browse/SPARK-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979380#comment-13979380 ] Sean Owen commented on SPARK-1605: -- I think this was on purpose, to try to hide breeze as an implementation detail, at least in public APIs? Improve mllib.linalg.Vector --- Key: SPARK-1605 URL: https://issues.apache.org/jira/browse/SPARK-1605 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sandeep Singh We can make current Vector a wrapper around Breeze.linalg.Vector ? I want to work on this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1629) Spark Core missing commons-lang dependence
[ https://issues.apache.org/jira/browse/SPARK-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980878#comment-13980878 ] Sean Owen commented on SPARK-1629: -- I don't see any usage of Commons Lang in the whole project? Tachyon uses commons-lang3 but it also brings it in as a dependency. Spark Core missing commons-lang dependence --- Key: SPARK-1629 URL: https://issues.apache.org/jira/browse/SPARK-1629 Project: Spark Issue Type: Bug Reporter: witgo -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1638) Executors fail to come up if spark.executor.extraJavaOptions is set
[ https://issues.apache.org/jira/browse/SPARK-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983156#comment-13983156 ] Sean Owen commented on SPARK-1638: -- Almost certainly a duplicate of https://issues.apache.org/jira/browse/SPARK-1609 Executors fail to come up if spark.executor.extraJavaOptions is set -- Key: SPARK-1638 URL: https://issues.apache.org/jira/browse/SPARK-1638 Project: Spark Issue Type: Bug Components: Spark Core Environment: Bring up a cluster in EC2 using spark-ec2 scripts Reporter: Kalpit Shah Fix For: 1.0.0 If you try to launch a PySpark shell with spark.executor.extraJavaOptions set to -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, the executors never come up on any of the workers. I see the following error in log file : Spark Executor Command: /usr/lib/jvm/java/bin/java -cp /root/c3/lib/*::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar: -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms13312M -Xmx13312M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@HOSTNAME:45429/user/CoarseGrainedScheduler 7 HOSTNAME 4 akka.tcp://sparkWorker@HOSTNAME:39727/user/Worker app-20140423224526- Unrecognized VM option 'UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1663) Spark Streaming docs code has several small errors
Sean Owen created SPARK-1663: Summary: Spark Streaming docs code has several small errors Key: SPARK-1663 URL: https://issues.apache.org/jira/browse/SPARK-1663 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 0.9.1 Reporter: Sean Owen Priority: Minor The changes are easiest to elaborate in the PR, which I will open shortly. Those changes raised a few little questions about the API too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1663) Spark Streaming docs code has several small errors
[ https://issues.apache.org/jira/browse/SPARK-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984466#comment-13984466 ] Sean Owen commented on SPARK-1663: -- PR: https://github.com/apache/spark/pull/589 Spark Streaming docs code has several small errors -- Key: SPARK-1663 URL: https://issues.apache.org/jira/browse/SPARK-1663 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 0.9.1 Reporter: Sean Owen Priority: Minor Labels: streaming The changes are easiest to elaborate in the PR, which I will open shortly. Those changes raised a few little questions about the API too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1684) Merge script should standardize SPARK-XXX prefix
[ https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985770#comment-13985770 ] Sean Owen commented on SPARK-1684: -- (Can I be pedantic and suggest standardizing to SPARK-XXX ? this is how issues are reported in other projects, like HIVE-123 and MAPREDUCE-234) Merge script should standardize SPARK-XXX prefix Key: SPARK-1684 URL: https://issues.apache.org/jira/browse/SPARK-1684 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Minor Fix For: 1.1.0 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK XXX: Issue -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1693) Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986471#comment-13986471 ] Sean Owen commented on SPARK-1693: -- I suspect this occurs because two copies of the servlet API jars are included from two sources, and one of those sources includes jar signing information in its manifest. The resulting merged jar has collisions in the signing information and it no longer matches. If that's right, the fastest way to avoid this is usually to drop signing information that is in the manifest since it is not helpful in the assembly jar. Of course it's ideal to avoid merging two copies of the same dependencies, since only one can be included, and that's why we see some [warn] in the build. In just about all cases it is harmless since they are actually copies of the same version of the same classes. I will look into what ends up in the manifest. Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0 Key: SPARK-1693 URL: https://issues.apache.org/jira/browse/SPARK-1693 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Guoqiang Li Assignee: Guoqiang Li Attachments: log.txt {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 log.txt{code} The log: {code} UnpersistSuite: - unpersist RDD *** FAILED *** java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666) at java.lang.ClassLoader.defineClass(ClassLoader.java:794) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1693) Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986490#comment-13986490 ] Sean Owen commented on SPARK-1693: -- I think this is traceable to a case of jar conflict. I am not sure whether the ultimate cause is signing, but it doesn't matter, since we should simply resolve the conflict. (But I think something like that may be at play, since one of the problem dependencies is from Eclipse's jetty, and there is an Eclipse cert in the manifest at META-INF/ECLIPSEF.RSA...) Anyway. This is another fun jar hell puzzler, albeit one with a probable solution. The basic issues is that Jetty brings in the Servlet 3.0 API: {code} [INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile {code} ... but in Hadoop 2.3.0+, so does Hadoop client: {code} [INFO] | +- org.apache.hadoop:hadoop-client:jar:2.4.0:compile ... [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.4.0:compile [INFO] | | | \- org.apache.hadoop:hadoop-yarn-common:jar:2.4.0:compile [INFO] | | | +- javax.xml.bind:jaxb-api:jar:2.2.2:compile ... [INFO] | | | +- javax.servlet:servlet-api:jar:2.5:compile {code} Eclipse is naughty for packaging the same API classes in a different artifact, rather than just using javax.servlet:servlet-api 3.0. There may be a reason for that, which is what worries me. In theory Servlet 3.0 is a superset of 2.5, so Hadoop's client should be happy with the 3.0 API on the classpath. So solution #1 to try is to exclude javax.servlet:servlet-api from the hadoop-client dependency. It won't affect earlier versions at all since they don't have this dependency, and therefore need not even be version-specific. Hadoop 2.3+ then in theory should happily find Eclipse's Servlet 3.0 API classes and work as normal. I give that about an 90% chance of being true. Solution #2 is more theoretically sound. We should really exclude Jetty's custom copy of Servlet 3.0, and depend on javax.servlet:servlet-api:jar:3.0 as a runtime dependency. This should transparently override Hadoop 2.3+'s version, and still work for Hadoop. Messing with Jetty increases the chance of another snag. Probability of success: 80% Li would you be able to try either of those ideas? Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0 Key: SPARK-1693 URL: https://issues.apache.org/jira/browse/SPARK-1693 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Guoqiang Li Assignee: Guoqiang Li Attachments: log.txt {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 log.txt{code} The log: {code} UnpersistSuite: - unpersist RDD *** FAILED *** java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666) at java.lang.ClassLoader.defineClass(ClassLoader.java:794) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1693) Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986507#comment-13986507 ] Sean Owen commented on SPARK-1693: -- Correct. I thought you had tried option #1 since you had suggested it too. Can you not test it in the same way that you discovered the problem? If you try, I'd suggest the #2 option, because if that works, it is a more robust solution. Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0 Key: SPARK-1693 URL: https://issues.apache.org/jira/browse/SPARK-1693 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Guoqiang Li Assignee: Guoqiang Li Attachments: log.txt {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 log.txt{code} The log: {code} UnpersistSuite: - unpersist RDD *** FAILED *** java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666) at java.lang.ClassLoader.defineClass(ClassLoader.java:794) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1693) Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986523#comment-13986523 ] Sean Owen commented on SPARK-1693: -- No, that's exactly the problem as far as I can tell. My suggestion is: - In core, exclude org.eclipse.jetty.orbit:javax.servlet from org.eclipse.jetty:jetty-server - Declare a runtime-scope dependency in core, on javax.servlet:servlet-api:jar:3.0 - (And that will happen to override Hadoop's javax.servlet:servlet-api:jar:2.5 !) - Update the SBT build as closely as possible to match Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0 Key: SPARK-1693 URL: https://issues.apache.org/jira/browse/SPARK-1693 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Guoqiang Li Assignee: Guoqiang Li Attachments: log.txt {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 log.txt{code} The log: {code} UnpersistSuite: - unpersist RDD *** FAILED *** java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666) at java.lang.ClassLoader.defineClass(ClassLoader.java:794) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1693) Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986631#comment-13986631 ] Sean Owen commented on SPARK-1693: -- Yes looks very close to what I had in mind; I have two suggestions: - To be ultra-safe, use version 3.0.0 of the Servlet API not 3.0.1 - Maybe drop a comment in the parent pom about why this dependency exists -- even just a reference to this JIRA Does it work for you then? fingers crossed. Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0 Key: SPARK-1693 URL: https://issues.apache.org/jira/browse/SPARK-1693 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Guoqiang Li Assignee: Guoqiang Li Attachments: log.txt {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 log.txt{code} The log: {code} UnpersistSuite: - unpersist RDD *** FAILED *** java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666) at java.lang.ClassLoader.defineClass(ClassLoader.java:794) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987686#comment-13987686 ] Sean Owen commented on SPARK-1698: -- (Copying an earlier comment that went to the mailing list, but didn't make it here:) #1 and #2 are not relevant the issue of jar size. These can be problems in general, but don't think there have been issues attributable to file clashes. Shading has mechanisms to deal with this anyway. #3 is a problem in general too, but is not specific to shading. Where versions collide, build processes like Maven and shading must be used to resolve them. But this happens regardless of whether you shade a fat jar. #4 is a real problem specific to Java 6. It does seem like it will be important to identify and remove more unnecessary dependencies to work around it. But shading per se is not the problem, and it is important to make a packaged jar for the app. What are you proposing? Dependencies to be removed? Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of the jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987698#comment-13987698 ] Sean Owen commented on SPARK-1698: -- What is the suggested change in this particular JIRA? I saw the PR, which seems to replace the shade with assembly plugin. Given the reference to https://issues.scala-lang.org/browse/SI-6660 are you suggesting that your assembly change packages differently, by putting jars in jars? Yes, the issue you link to is exactly the kind of problem that can occur with this approach. It comes up a bit in Hadoop as well. Even though it is in theory a fine way to do things. But is that what you're getting at? Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of the jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-1556) jets3t dep doesn't update properly with newer Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1556: - Comment: was deleted (was: Actually, why does Spark have a direct dependency on jets3t at all? it is not used directly in the code. If it's only needed at runtime, it can/should be declared that way. But if the reason it's there is just for Hadoop, then of course hadoop-client is already bringing it in, and should be allowed to bring in the version it wants.) jets3t dep doesn't update properly with newer Hadoop versions - Key: SPARK-1556 URL: https://issues.apache.org/jira/browse/SPARK-1556 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0, 1.0.0 Reporter: Nan Zhu Assignee: Nan Zhu Priority: Blocker Fix For: 1.0.0 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines S3ServiceException/ServiceException is introduced, however, Spark still relies on Jet3st 0.7.x which has no definition of these classes What I met is that [code] 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.SparkContext.runJob(SparkContext.scala:891) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) at .init(console:30) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040) at
[jira] [Commented] (SPARK-1520) Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6
[ https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13989098#comment-13989098 ] Sean Owen commented on SPARK-1520: -- [~pwendell] On this note, I wonder if it's also best to make Jenkins build with Java 6? I'm not quite sure if it catches things like this, but catches things with similar roots. I had a request open at https://issues.apache.org/jira/browse/SPARK-1437 but it's a Jenkins change rather than a code change. Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6 - Key: SPARK-1520 URL: https://issues.apache.org/jira/browse/SPARK-1520 Project: Spark Issue Type: Bug Components: MLlib, Spark Core Reporter: Patrick Wendell Assignee: Xiangrui Meng Priority: Blocker Fix For: 1.0.0 This is a real doozie - when compiling a Spark assembly with JDK7, the produced jar does not work well with JRE6. I confirmed the byte code being produced is JDK 6 compatible (major version 50). What happens is that, silently, the JRE will not load any class files from the assembled jar. {code} $ sbt/sbt assembly/assembly $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] [FIFO|FAIR] $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/ui/UIWorkloadGenerator Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.UIWorkloadGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. Program will exit. {code} I also noticed that if the jar is unzipped, and the classpath set to the currently directory, it just works. Finally, if the assembly jar is compiled with JDK6, it also works. The error is seen with any class, not just the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only in master. h1. Isolation and Cause The package-time behavior of Java 6 and 7 differ with respect to the format used for jar files: ||Number of entries||JDK 6||JDK 7|| |= 65536|zip|zip| | 65536|zip*|zip64| zip* is a workaround for the original zip format that [described in JDK-6828461|https://bugs.openjdk.java.net/browse/JDK-4828461] that allows some versions of Java 6 to support larger assembly jars. The Scala libraries we depend on have added a large number of classes which bumped us over the limit. This causes the Java 7 packaging to not work with Java 6. We can probably go back under the limit by clearing out some accidental inclusion of FastUtil, but eventually we'll go over again. The real answer is to force people to build with JDK 6 if they want to run Spark on JRE 6. -I've found that if I just unpack and re-pack the jar (using `jar`) it always works:- {code} $ cd assembly/target/scala-2.10/ $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # fails $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar * $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # succeeds {code} -I also noticed something of note. The Breeze package contains single directories that have huge numbers of files in them (e.g. 2000+ class files in one directory). It's possible we are hitting some weird bugs/corner cases with compatibility of the internal storage format of the jar itself.- -I narrowed this down specifically to the inclusion of the breeze library. Just adding breeze to an older (unaffected) build triggered the issue.- -I ran a git bisection and this appeared after the MLLib sparse vector patch was merged:- https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534 SPARK-1212 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1727) Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs
Sean Owen created SPARK-1727: Summary: Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs Key: SPARK-1727 URL: https://issues.apache.org/jira/browse/SPARK-1727 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 0.9.1 Reporter: Sean Owen Priority: Minor While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs. Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1802) Audit dependency graph when Spark is built with -Phive
[ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994990#comment-13994990 ] Sean Owen commented on SPARK-1802: -- [~pwendell] You can see my start on it here: https://github.com/srowen/spark/commits/SPARK-1802 https://github.com/srowen/spark/commit/a856604cfc67cb58146ada01fda6dbbb2515fa00 This resolves the new issues you note in your diff. Next issue is that hive-exec, quite awfully, includes a copy of all of its transitive dependencies in its artifact. See https://issues.apache.org/jira/browse/HIVE-5733 and note the warnings you'll get during assembly: {code} [WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping classes: [WARNING] - org.apache.thrift.transport.TSaslTransport$SaslResponse ... {code} hive-exec is in fact used in this module. Aside from actual surgery on the artifact with the shade plugin, you can't control the dependencies as a result. This may be simply the best that can be done right now. If it has worked, it has worked. Am I right that the datanucleus JARs *are* meant to be in the assembly, only for the Hive build? https://github.com/apache/spark/pull/688 https://github.com/apache/spark/pull/610 That's good if so since that's what your diff shows. Finally, while we're here, I note that there are still a few JAR conflicts that turn up when you build the assembly *without* Hive. (I'm going to ignore conflicts in examples; these can be cleaned up but aren't really a big deal given its nature.) We could touch those up too. This is in the normal build (and I know how to zap most of this problem): {code} [WARNING] commons-beanutils-core-1.8.0.jar, commons-beanutils-1.7.0.jar define 82 overlappping classes: {code} These turn up in the Hadoop 2.x + YARN build: {code} [WARNING] servlet-api-2.5.jar, javax.servlet-3.0.0.v201112011016.jar define 42 overlappping classes: ... [WARNING] jcl-over-slf4j-1.7.5.jar, commons-logging-1.1.3.jar define 6 overlappping classes: ... [WARNING] activation-1.1.jar, javax.activation-1.1.0.v201105071233.jar define 17 overlappping classes: ... [WARNING] servlet-api-2.5.jar, javax.servlet-3.0.0.v201112011016.jar define 42 overlappping classes: {code} These should be easy to track down. Shall I? Audit dependency graph when Spark is built with -Phive -- Key: SPARK-1802 URL: https://issues.apache.org/jira/browse/SPARK-1802 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive
[ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1802: - Attachment: hive-exec-jar-problems.txt Audit dependency graph when Spark is built with -Phive -- Key: SPARK-1802 URL: https://issues.apache.org/jira/browse/SPARK-1802 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Sean Owen Priority: Blocker Fix For: 1.0.0 Attachments: hive-exec-jar-problems.txt I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1802) Audit dependency graph when Spark is built with -Phive
[ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995815#comment-13995815 ] Sean Owen commented on SPARK-1802: -- I looked further into just what might go wrong by including hive-exec into the assembly, since it includes its dependencies directly (i.e. Maven can't manage around it.) Attached is a full dump of the conflicts. The ones that are potential issues appear to be the following, and one looks like it could be a deal-breaker -- protobuf -- since it's neither forwards nor backwards compatible. That is, I recommend testing this assembly with an older Hadoop that needs 2.4.1 and see if it croaks. The rest might be worked around but need some additional mojo to make sure the right version wins in the packaging. Certainly having hive-exec in the build is making me queasy! [WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping classes: HBase includes libthrift-0.8.0, but it's in examples, and so figure this is ignorable. [WARNING] hive-exec-0.12.0.jar, commons-lang-2.4.jar define 2 overlappping classes: Probably ignorable, but we have to make sure commons-lang-3.3.2 'wins' in the build. [WARNING] hive-exec-0.12.0.jar, jackson-core-asl-1.9.11.jar define 117 overlappping classes: [WARNING] hive-exec-0.12.0.jar, jackson-mapper-asl-1.8.8.jar define 432 overlappping classes: Believe this are ignorable. (Not sure why the jackson versions are mismatched? another todo) [WARNING] hive-exec-0.12.0.jar, guava-14.0.1.jar define 1087 overlappping classes: Should be OK. Hive uses 11.0.2 like Hadoop; the build is already taking that particular risk. We need 14.0.1 to win. [WARNING] hive-exec-0.12.0.jar, protobuf-java-2.4.1.jar define 204 overlappping classes: Oof. Hive has protobuf 2.5.0. This has got to be a problem for older Hadoop builds? Audit dependency graph when Spark is built with -Phive -- Key: SPARK-1802 URL: https://issues.apache.org/jira/browse/SPARK-1802 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Sean Owen Priority: Blocker Fix For: 1.0.0 I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1760) mvn -Dsuites=* test throw an ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993493#comment-13993493 ] Sean Owen commented on SPARK-1760: -- If `wildcardSuites` lets you invoke specific suites across the whole project, then that sounds like an ideal solution. If it works then I'd propose that as a small doc change? mvn -Dsuites=* test throw an ClassNotFoundException -- Key: SPARK-1760 URL: https://issues.apache.org/jira/browse/SPARK-1760 Project: Spark Issue Type: Bug Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 {{mvn -Dhadoop.version=0.23.9 -Phadoop-0.23 -Dsuites=org.apache.spark.repl.ReplSuite test}} = {code} *** RUN ABORTED *** java.lang.ClassNotFoundException: org.apache.spark.repl.ReplSuite at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1470) at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1469) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.immutable.List.foreach(List.scala:318) ... {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1802) Audit dependency graph when Spark is built with -Phive
[ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995815#comment-13995815 ] Sean Owen edited comment on SPARK-1802 at 5/13/14 11:27 AM: (Edited to fix comment about protobuf versions) I looked further into just what might go wrong by including hive-exec into the assembly, since it includes its dependencies directly (i.e. Maven can't manage around it.) Attached is a full dump of the conflicts. The ones that are potential issues appear to be the following, and one looks like it could be a deal-breaker -- protobuf -- since it's neither forwards nor backwards compatible. That is, I recommend testing this assembly with an *newer* Hadoop that needs 2.5 and see if it croaks. The rest might be worked around but need some additional mojo to make sure the right version wins in the packaging. Certainly having hive-exec in the build is making me queasy! [WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping classes: HBase includes libthrift-0.8.0, but it's in examples, and so figure this is ignorable. [WARNING] hive-exec-0.12.0.jar, commons-lang-2.4.jar define 2 overlappping classes: Probably ignorable, but we have to make sure commons-lang-3.3.2 'wins' in the build. [WARNING] hive-exec-0.12.0.jar, jackson-core-asl-1.9.11.jar define 117 overlappping classes: [WARNING] hive-exec-0.12.0.jar, jackson-mapper-asl-1.8.8.jar define 432 overlappping classes: Believe this are ignorable. (Not sure why the jackson versions are mismatched? another todo) [WARNING] hive-exec-0.12.0.jar, guava-14.0.1.jar define 1087 overlappping classes: Should be OK. Hive uses 11.0.2 like Hadoop; the build is already taking that particular risk. We need 14.0.1 to win. [WARNING] hive-exec-0.12.0.jar, protobuf-java-2.4.1.jar define 204 overlappping classes: Oof. Hive has protobuf *2.4.1*. This has got to be a problem for newer Hadoop builds? (Edited to fix comment about protobuf versions) was (Author: srowen): I looked further into just what might go wrong by including hive-exec into the assembly, since it includes its dependencies directly (i.e. Maven can't manage around it.) Attached is a full dump of the conflicts. The ones that are potential issues appear to be the following, and one looks like it could be a deal-breaker -- protobuf -- since it's neither forwards nor backwards compatible. That is, I recommend testing this assembly with an older Hadoop that needs 2.4.1 and see if it croaks. The rest might be worked around but need some additional mojo to make sure the right version wins in the packaging. Certainly having hive-exec in the build is making me queasy! [WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping classes: HBase includes libthrift-0.8.0, but it's in examples, and so figure this is ignorable. [WARNING] hive-exec-0.12.0.jar, commons-lang-2.4.jar define 2 overlappping classes: Probably ignorable, but we have to make sure commons-lang-3.3.2 'wins' in the build. [WARNING] hive-exec-0.12.0.jar, jackson-core-asl-1.9.11.jar define 117 overlappping classes: [WARNING] hive-exec-0.12.0.jar, jackson-mapper-asl-1.8.8.jar define 432 overlappping classes: Believe this are ignorable. (Not sure why the jackson versions are mismatched? another todo) [WARNING] hive-exec-0.12.0.jar, guava-14.0.1.jar define 1087 overlappping classes: Should be OK. Hive uses 11.0.2 like Hadoop; the build is already taking that particular risk. We need 14.0.1 to win. [WARNING] hive-exec-0.12.0.jar, protobuf-java-2.4.1.jar define 204 overlappping classes: Oof. Hive has protobuf 2.5.0. This has got to be a problem for older Hadoop builds? Audit dependency graph when Spark is built with -Phive -- Key: SPARK-1802 URL: https://issues.apache.org/jira/browse/SPARK-1802 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Sean Owen Priority: Blocker Fix For: 1.0.0 Attachments: hive-exec-jar-problems.txt I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar
[jira] [Commented] (SPARK-1789) Multiple versions of Netty dependencies cause FlumeStreamSuite failure
[ https://issues.apache.org/jira/browse/SPARK-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997317#comment-13997317 ] Sean Owen commented on SPARK-1789: -- I don't have any info either way on that. Later is always better no? probably OK to consider post-1.0? The issue here was to do with Netty, and the comment about Akka that I quoted was really meant to suggest that it was Netty (as it happens being imported by Akka) that was relevant. Multiple versions of Netty dependencies cause FlumeStreamSuite failure -- Key: SPARK-1789 URL: https://issues.apache.org/jira/browse/SPARK-1789 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Sean Owen Assignee: Sean Owen Labels: flume, netty, test Fix For: 1.0.0 TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure. I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?) velvia notes: I have found a workaround. If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty. There are at least 3 versions of Netty in play in the build: - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6. - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue. The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final. But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile. If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation. So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict: - Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts - Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty - Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent - Update SBT build accordingly A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1827) LICENSE and NOTICE files need a refresh to contain transitive dependency info
[ https://issues.apache.org/jira/browse/SPARK-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997458#comment-13997458 ] Sean Owen commented on SPARK-1827: -- LICENSE and NOTICE policy is explained here: http://www.apache.org/dev/licensing-howto.html http://www.apache.org/legal/3party.html This leads to the following changes. First, this change enables two extensions to maven-shade-plugin in assembly/ that will try to include and merge all NOTICE and LICENSE files. This can't hurt. This generates a consolidated NOTICE file that I manually added to NOTICE. Next, a list of all dependencies and their licenses was generated: mvn ... license:aggregate-add-third-party to create: target/generated-sources/license/THIRD-PARTY.txt Each dependency is listed with one or more licenses. Determine the most-compatible license for each if there is more than one. For unknown license dependencies, I manually evaluateD their license. Many are actually Apache projects or components of projects covered already. The only non-trivial one was Colt, which has its own (compatible) license. I ignored Apache-licensed and public domain dependencies as these require no further action (beyond NOTICE above). BSD and MIT licenses (permissive Category A licenses) are evidently supposed to be mentioned in LICENSE, so I added a section without output from the THIRD-PARTY.txt file appropriately. Everything else, Category B licenses, are evidently mentioned in NOTICE (?) Same there. LICENSE contained some license statements for source code that is redistributed. I left this as I think that is the right place to put it. LICENSE and NOTICE files need a refresh to contain transitive dependency info - Key: SPARK-1827 URL: https://issues.apache.org/jira/browse/SPARK-1827 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Sean Owen Priority: Blocker Fix For: 1.0.0 (Pardon marking it a blocker, but think it needs doing before 1.0 per chat with [~pwendell]) The LICENSE and NOTICE files need to cover all transitive dependencies, since these are all distributed in the assembly jar. (c.f. http://www.apache.org/dev/licensing-howto.html ) I don't believe the current files cover everything. It's possible to mostly-automatically generate these. I will generate this and propose a patch to both today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1827) LICENSE and NOTICE files need a refresh to contain transitive dependency info
Sean Owen created SPARK-1827: Summary: LICENSE and NOTICE files need a refresh to contain transitive dependency info Key: SPARK-1827 URL: https://issues.apache.org/jira/browse/SPARK-1827 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Sean Owen Priority: Blocker (Pardon marking it a blocker, but think it needs doing before 1.0 per chat with [~pwendell]) The LICENSE and NOTICE files need to cover all transitive dependencies, since these are all distributed in the assembly jar. (c.f. http://www.apache.org/dev/licensing-howto.html ) I don't believe the current files cover everything. It's possible to mostly-automatically generate these. I will generate this and propose a patch to both today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1760) mvn -Dsuites=* test throw an ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992837#comment-13992837 ] Sean Owen commented on SPARK-1760: -- Yeah I think you would have to run this from the `repl/` module for it it work. At least it does for me, and that makes sense. I think the docs just need to note that, at: https://spark.apache.org/docs/0.9.1/building-with-maven.html (and that suite name can be updated to include org.apache) mvn -Dsuites=* test throw an ClassNotFoundException -- Key: SPARK-1760 URL: https://issues.apache.org/jira/browse/SPARK-1760 Project: Spark Issue Type: Bug Reporter: Guoqiang Li {{mvn -Dhadoop.version=0.23.9 -Phadoop-0.23 -Dsuites=org.apache.spark.repl.ReplSuite test}} = {code} *** RUN ABORTED *** java.lang.ClassNotFoundException: org.apache.spark.repl.ReplSuite at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1470) at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1469) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.immutable.List.foreach(List.scala:318) ... {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1575) failing tests with master branch
[ https://issues.apache.org/jira/browse/SPARK-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996955#comment-13996955 ] Sean Owen commented on SPARK-1575: -- For what it's worth, I no longer see this failure I believe this has been resolved by other changes along the way. failing tests with master branch - Key: SPARK-1575 URL: https://issues.apache.org/jira/browse/SPARK-1575 Project: Spark Issue Type: Test Reporter: Nishkam Ravi Priority: Blocker Built the master branch against Hadoop version 2.3.0-cdh5.0.0 with SPARK_YARN=true. sbt tests don't go through successfully (tried multiple runs). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1787) Build failure on JDK8 :: SBT fails to load build configuration file
[ https://issues.apache.org/jira/browse/SPARK-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998492#comment-13998492 ] Sean Owen commented on SPARK-1787: -- Duplicate of https://issues.apache.org/jira/browse/SPARK-1444 it appears Build failure on JDK8 :: SBT fails to load build configuration file --- Key: SPARK-1787 URL: https://issues.apache.org/jira/browse/SPARK-1787 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 0.9.0 Environment: JDK8 Scala 2.10.X SBT 0.12.X Reporter: Richard Gomes Priority: Minor SBT fails to build under JDK8. Please find steps to reproduce the error below: (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ uname -a Linux terra 3.13-1-amd64 #1 SMP Debian 3.13.10-1 (2014-04-15) x86_64 GNU/Linux (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ java -version java version 1.8.0_05 Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode) (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ scala -version Scala code runner version 2.10.3 -- Copyright 2002-2013, LAMP/EPFL (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ sbt/sbt clean Launching sbt from sbt/sbt-launch-0.12.4.jar Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=350m; support was removed in 8.0 [info] Loading project definition from /home/rgomes/workspace/spark-0.9.1/project/project [info] Compiling 1 Scala source to /home/rgomes/workspace/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes... [error] error while loading CharSequence, class file '/opt/developer/jdk1.8.0_05/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken [error] (bad constant pool tag 15 at byte 1501) [error] error while loading Comparator, class file '/opt/developer/jdk1.8.0_05/jre/lib/rt.jar(java/util/Comparator.class)' is broken [error] (bad constant pool tag 15 at byte 5003) [error] two errors found [error] (compile:compile) Compilation failed Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? q -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998494#comment-13998494 ] Sean Owen commented on SPARK-1473: -- I believe these types of thing were more the goals of the MLI and MLbase projects rather than MLlib? I don't know the status of those. For what it's worth I think these are very useful things but in a separate 'layer' above something like MLlib. Feature selection for high dimensional datasets --- Key: SPARK-1473 URL: https://issues.apache.org/jira/browse/SPARK-1473 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ignacio Zendejas Priority: Minor Labels: features Fix For: 1.1.0 For classification tasks involving large feature spaces in the order of tens of thousands or higher (e.g., text classification with n-grams, where n 1), it is often useful to rank and filter features that are irrelevant thereby reducing the feature space by at least one or two orders of magnitude without impacting performance on key evaluation metrics (accuracy/precision/recall). A feature evaluation interface which is flexible needs to be designed and at least two methods should be implemented with Information Gain being a priority as it has been shown to be amongst the most reliable. Special consideration should be taken in the design to account for wrapper methods (see research papers below) which are more practical for lower dimensional data. Relevant research: * Brown, G., Pocock, A., Zhao, M. J., Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. * Forman, George. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research 3 (2003): 1289-1305. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001424#comment-14001424 ] Sean Owen commented on SPARK-1875: -- (That's correct that commons-lang and commons-lang3 use separate packages.) NoClassDefFoundError: StringUtils when building against Hadoop 1 Key: SPARK-1875 URL: https://issues.apache.org/jira/browse/SPARK-1875 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Assignee: Guoqiang Li Priority: Blocker Fix For: 1.0.0 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 and Hive enabled, if I go into it and run spark-shell, I get this: {code} java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236) at org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007339#comment-14007339 ] Sean Owen commented on SPARK-1867: -- Yes, the 'mr1' artifacts are for when you are *not* using YARN. These are unusual to use for CDH5, and you would not need those versions. The stock Spark artifacts are for Hadoop 1, not Hadoop 2. It can be built for Hadoop 2 and installed locally if you like. You can use the matched CDH5 version, which is of course made for Hadoop 2, by targeting '0.9.0-cdh5.0.1' for example. (I don't have a release schedule but assume some later version will be released with 5.1 of course) There shouldn't be any trial an error to it, if you express as dependencies all the things you directly use. For example you say your app uses HBase but I see no dependence on the client libraries. This is nothing to do with Spark per se. There shouldn't be any trial and error about it, or else you're probably trying to do something the wrong way around. What classes do you expect you're looking for? Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008077#comment-14008077 ] Sean Owen commented on SPARK-1867: -- I could be way wrong here, partly as a function of not knowing the context entirely, but if things are being fixed by including bits and pieces of CDH-flavored M/R dependency then you may just be patching over the fact that you really need to use the Hadoop2/CDH-flavored Spark artifacts to begin with. If anyone thinks it's useful to discuss how this is done normally, offline, I'd be happy to say what I know, given more info. Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1, org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0, // standard, probably unrelated com.github.seratch %% awscala % [0.2,), org.scalacheck %% scalacheck % 1.10.1 % test, org.specs2 %% specs2 % 1.14 % test, org.scala-lang % scala-reflect % 2.10.3, org.scalaz %% scalaz-core % 7.0.5, net.minidev % json-smart % 1.2 -- This message was sent
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008329#comment-14008329 ] Sean Owen commented on SPARK-1867: -- Well, for example, MapReduce classes are from one of the Hadoop hadoop-mapreduce-* artifacts. Generally speaking, apps depend on client artifacts, which then bring in other necessary dependencies for you. You don't manually add dependencies for your dependencies; that's what Maven is doing for you. (Although we all know some artifacts don't express their dependencies 100% correctly all the time...) That said, the upstream projects do change over time. For example, in Hadoop 1.x, there was a hadoop-core artifact. In Hadoop 2.x things were broken out further, which is generally good, but now there is hadoop-common. But you don't depend on that to use Hadoop, you would use hadoop-mapreduce-client for example if using MapReduce. Same is true of HBase I imagine. CDH and other distributions do not move things around -- upstream projects do. Simple use cases are simple to get working. For example if you're using Spark's core, you just depend on the one spark-core artifact. (With the caveat that you need to depend on a different Hadoop 2-compatible artifacts if you use Hadoop 2 -- still one artifacts, but may be 0.9.0-cdh5.0.1 for example.) If you use HBase, you depend on whatever the hbase client artifact is too. There are gotchas here to be sure, and bits of project-specific knowledge that are required, but it's not nearly random guesswork. Maybe you can separately say what you are trying to depend on and people can confirm the few direct dependencies you should depend on. I suspect there are some fundamental problems, like depending on the wrong Spark artifact. It sounds like you are manually trying to replace all of Spark's Hadoop 1 dependencies with Hadoop 2 dependencies, and that way lies madness. Use the one build for Hadoop 2. Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at
[jira] [Commented] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009472#comment-14009472 ] Sean Owen commented on SPARK-1935: -- Yeah, I think this is a matter of Maven's nearest-first vs SBT's latest-first conflict resolution strategy. It should be safe to manually manage this to 1.5, I believe. Explicitly add commons-codec 1.4 as a dependency Key: SPARK-1935 URL: https://issues.apache.org/jira/browse/SPARK-1935 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Yin Huai Priority: Minor Right now, commons-codec is a transitive dependency. When Spark is built by maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an older version (Hadoop 1.0.4 depends on 1.4). This older version can cause problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010294#comment-14010294 ] Sean Owen commented on SPARK-1518: -- 0.20.x stopped in early 2010. It is ancient. Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010322#comment-14010322 ] Sean Owen commented on SPARK-1518: -- RE: Hadoop versions, in my reckoning of the twisted world of Hadoop versions, the 0.23.x branch is still active and so is kind of later than 1.0.x. It may be easier to retain 0.23 compatibility than 1.0.x for example. Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010937#comment-14010937 ] Sean Owen commented on SPARK-1518: -- Re: versioning one more time, really supporting a bunch of versions may get costly. It's already tricky to manage two builds times YARN-or-not, Hive-or-not, times 4 flavors of Hadoop. I doubt the assemblies are yet problem-free in all cases. In practice it look like one generic Hadoop 1, Hadoop 2, and CDH 4 release is produced, and 1 set of Maven artifact. (PS again I am not sure Spark should contain a CDH-specific distribution? realizing it's really a proxy for a particular Hadoop combo. Same goes for a MapR profile, which is really for vendors to maintain) That means right now you can't build a Spark app for anything but Hadoop 1.x with Maven, without installing it yourself, and there's not an official distro for anything but two major Hadoop versions. Support for niche versions isn't really there or promised anyway, and fleshing out support may make doing so pretty burdensome. There is no suggested action here; if anything I suggest that the right thing is to add Maven artifacts with classifiers, add a few binary artifacts, subtract a few vendor artifacts, but this is a different action. Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1950) spark on yarn can't start
[ https://issues.apache.org/jira/browse/SPARK-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011330#comment-14011330 ] Sean Owen commented on SPARK-1950: -- (Looks like you opened this twice? https://issues.apache.org/jira/browse/SPARK-1951 ) spark on yarn can't start -- Key: SPARK-1950 URL: https://issues.apache.org/jira/browse/SPARK-1950 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Guoqiang Li Priority: Blocker {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --archives /input/lbs/recommend/toona/spark/conf toona-assembly.jar 20140521}}throw an exception: {code} Exception in thread main java.io.FileNotFoundException: File file:/input/lbs/recommend/toona/spark/conf does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) at org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74) at org.apache.spark.deploy.yarn.Client.run(Client.scala:96) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:186) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --archives hdfs://10dian72:8020/input/lbs/recommend/toona/spark/conf toona-assembly.jar 20140521}} work. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011897#comment-14011897 ] Sean Owen commented on SPARK-1518: -- they write their app against the Spark API's in Maven central (they can do this no matter which cluster they want to run on) Yeah this is the issue. OK, if I compile against Spark artifacts as a runtime dependency and submit an app to the cluster, it should be OK no matter what build of Spark is running. The binding from Spark to Hadoop is hidden from the app. I am thinking of the case where I want to build an app that is a client of Spark -- embedding it. Then I am including the client of Hadoop for example. I have to match my cluster than and there is no Hadoop 2 Spark artifact. Am I missing something big here? that's my premise about why there would ever be a need for different artifacts. It's the same use case as in Sandy's blog: http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/ Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012309#comment-14012309 ] Sean Owen commented on SPARK-1867: -- There is no hadoop-io module. Modules are subcomponents of the components distributed in something like CDH, and are not versioned independently, so you would not find them described on that page. That is, if CDH X.Y includes Hadoop Z.W then it includes version Z.W of all Hadoop's modules. LongWritable was in hadoop-core in Hadoop 1.x, and is in hadoop-common in 2.x. Just about everything depends on these modules, so you should not find yourself missing them at runtime if you are depending on something like hadoop-client. There isn't a org.apache.hadoop.io.CombineTextInputFormat, not that I can see -- where do you see that referenced? There is org.apache.hadoop.mapreduce.lib.CombineTextInputFormat although it looks like it appeared from about Hadoop 2.1 (https://issues.apache.org/jira/browse/MAPREDUCE-5069) You would not be able to use it if you are running on Hadoop 1, and wouldn't find it if you depend on Hadoop 1 modules. It appears to be the in mapreduce-client-core module, but again, you wouldn't need to depend on that directly. That gets pulled in from hadoop-client, via hadoop-mapreduce-client. I can see it doing so when you build Spark for Hadoop 2, for example. I'm still not clear why you are needing to hunt for individual classes? Maybe one of those is just due to a package typo. Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012434#comment-14012434 ] Sean Owen commented on SPARK-1867: -- Something else is up; those are equivalent in Java and you couldn't have imported two symbols with the same name. I could look at the file offline if you think I might spot something else. Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1, org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0, // standard, probably unrelated com.github.seratch %% awscala % [0.2,), org.scalacheck %% scalacheck % 1.10.1 % test, org.specs2 %% specs2 % 1.14 % test, org.scala-lang % scala-reflect % 2.10.3, org.scalaz %% scalaz-core % 7.0.5, net.minidev % json-smart % 1.2 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012552#comment-14012552 ] Sean Owen commented on SPARK-1518: -- Heh, I think the essence is: at least one more separate Maven artifact, under a different classifier, for Hadoop 2.x builds. If you package that, you get Spark and everything it needs to work against a Hadoop 2 cluster. Yeah I see that you're suggesting various ways to push the app to the cluster, where it can bind to the right version of things, and that may be the right-est way to think about this. I had envisioned running a stand-alone app on a machine that is not part of the cluster, that is a client of it, and this means packaging in the right Hadoop client dependencies, and Spark already declares how it wants to include these various Hadoop client versions -- it's more than just including hadoop-client -- so wanted to leverage that. Let's see if this actually turns out to be a broader request though. Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1973) Add randomSplit to JavaRDD (with tests, and tidy Java tests)
Sean Owen created SPARK-1973: Summary: Add randomSplit to JavaRDD (with tests, and tidy Java tests) Key: SPARK-1973 URL: https://issues.apache.org/jira/browse/SPARK-1973 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sean Owen Priority: Minor I'd like to use randomSplit through the Java API, and would like to add a convenience wrapper for this method to JavaRDD. This is fairly trivial. (In fact, is the intent that JavaRDD not wrap every RDD method? and that sometimes users should just use JavaRDD.wrapRDD()?) Along the way, I added tests for it, and also touched up the Java API test style and behavior. This is maybe the more useful part of this small change. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013591#comment-14013591 ] Sean Owen commented on SPARK-1518: -- Sorry for one more message here to reply to Matei -- yes it's the laptop use case except I'd describe that as a not-uncommon production deployment! it's the embedded-client scenario. It is more than adding one hadoop-client dependency, because you need to emulate the excludes, etc that Spark has to. (But yeah then it works.) I agree supporting a bunch of Hadoop versions gets painful, as a result. This was why I was suggesting way up top that supporting old versions may become more trouble than its worht. Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1974) Most examples fail at startup because spark.master is not set
Sean Owen created SPARK-1974: Summary: Most examples fail at startup because spark.master is not set Key: SPARK-1974 URL: https://issues.apache.org/jira/browse/SPARK-1974 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.0.0 Reporter: Sean Owen Most example code has a few lines like: {code} val sparkConf = new SparkConf().setAppName(Foo) val sc = new SparkContext(sparkConf) {code} The SparkContext constructor throws a SparkException if spark.master is not set though, so this fails immediately. What would be preferred -- let spark.master default to local\[2\]? or change all examples to call: {code} new SparkContext(local[2], Foo) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1998) SparkFlumeEvent with body bigger than 1020 bytes are not read properly
[ https://issues.apache.org/jira/browse/SPARK-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016327#comment-14016327 ] Sean Owen commented on SPARK-1998: -- (Can you make a pull request versus a patch here? this looks important, I know we have some Flume streaming users.) SparkFlumeEvent with body bigger than 1020 bytes are not read properly -- Key: SPARK-1998 URL: https://issues.apache.org/jira/browse/SPARK-1998 Project: Spark Issue Type: Bug Reporter: sun.sam Attachments: patch.diff -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1974) Most examples fail at startup because spark.master is not set
[ https://issues.apache.org/jira/browse/SPARK-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1974. -- Resolution: Not a Problem Fix Version/s: (was: 1.0.1) Decision was to not modify examples, but to possibly set a spark.master default. See also https://issues.apache.org/jira/browse/SPARK-1906 Most examples fail at startup because spark.master is not set - Key: SPARK-1974 URL: https://issues.apache.org/jira/browse/SPARK-1974 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.0.0 Reporter: Sean Owen Most example code has a few lines like: {code} val sparkConf = new SparkConf().setAppName(Foo) val sc = new SparkContext(sparkConf) {code} The SparkContext constructor throws a SparkException if spark.master is not set though, so this fails immediately. What would be preferred -- let spark.master default to local\[2\]? or change all examples to call: {code} new SparkContext(local[2], Foo) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue
[ https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017528#comment-14017528 ] Sean Owen commented on SPARK-2018: -- The meaning of the error is that Java thinks two serializable classes are not mutually compatible. This is because two different serialVersioUIDs get computed for two copies of what may be the same class. If I understand you correctly, you are communicating between different JVM versions, or reading one's output from the other? I don't think it's guaranteed that the auto-generated serialVersionUID will be the same. If so, it's nothing to do with big-endian-ness per se. Does it happen entirely within the same machine / JVM? Big-Endian (IBM Power7) Spark Serialization issue -- Key: SPARK-2018 URL: https://issues.apache.org/jira/browse/SPARK-2018 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Environment: hardware : IBM Power7 OS:Linux version 2.6.32-358.el6.ppc64 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 20130617_152572 (JIT enabled, AOT enabled) Hadoop:Hadoop-0.2.3-CDH5.0 Spark:Spark-1.0.0 or Spark-0.9.1 spark-env.sh: export JAVA_HOME=/opt/ibm/java-ppc64-70/ export SPARK_MASTER_IP=9.114.34.69 export SPARK_WORKER_MEMORY=1m export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib export STANDALONE_SPARK_MASTER_HOST=9.114.34.69 #export SPARK_JAVA_OPTS=' -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n ' Reporter: Yanjie Gao We have an application run on Spark on Power7 System . But we meet an important issue about serialization. The example HdfsWordCount can meet the problem. ./bin/run-example org.apache.spark.examples.streaming.HdfsWordCount localdir We used Power7 (Big-Endian arch) and Redhat 6.4. Big-Endian is the main cause since the example ran successfully in another Power-based Little Endian setup. here is the exception stack and log: Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/ -XX:MaxPermSize=128m -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker app-20140604023054- 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:22 INFO Remoting: Starting remoting 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:24 INFO Remoting: Starting remoting 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses
[jira] [Commented] (SPARK-2026) Maven hadoop* Profiles Should Set the expected Hadoop Version.
[ https://issues.apache.org/jira/browse/SPARK-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018512#comment-14018512 ] Sean Owen commented on SPARK-2026: -- A few people have mentioned and asked for this, especially as it helps the build work cleanly in IntelliJ. FWIW I would like this change too. Do you have a PR? Maven hadoop* Profiles Should Set the expected Hadoop Version. Key: SPARK-2026 URL: https://issues.apache.org/jira/browse/SPARK-2026 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.0.0 Reporter: Bernardo Gomez Palacio The Maven Profiles that refer to _hadoopX_, e.g. hadoop2.4, should set the expected _hadoop.version_. e.g. {code} profile idhadoop-2.4/id properties protobuf.version2.5.0/protobuf.version jets3t.version0.9.0/jets3t.version /properties /profile {code} as it is suggested {code} profile idhadoop-2.4/id properties hadoop.version2.4.0/hadoop.version yarn.version${hadoop.version}/yarn.version protobuf.version2.5.0/protobuf.version jets3t.version0.9.0/jets3t.version /properties /profile {code} Builds can still define the -Dhadoop.version option but this will correctly default the Hadoop Version to the one that is expected according the profile that is selected. e.g. {code} $ mvn -P hadoop-2.4,yarn clean compile {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason
[ https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018760#comment-14018760 ] Sean Owen commented on SPARK-2019: -- I believe that's coming with 5.1 but I don't know when that is scheduled. We can talk about issues like this offline -- really your best bet is support anyway. Spark workers die/disappear when job fails for nearly any reason Key: SPARK-2019 URL: https://issues.apache.org/jira/browse/SPARK-2019 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Reporter: sam We either have to reboot all the nodes, or run 'sudo service spark-worker restart' across our cluster. I don't think this should happen - the job failures are often not even that bad. There is a 5 upvoted SO question here: http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails We shouldn't be giving restart privileges to our devs, and therefore our sysadm has to frequently restart the workers. When the sysadm is not around, there is nothing our devs can do. Many thanks -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2090) spark-shell input text entry not showing on REPL
[ https://issues.apache.org/jira/browse/SPARK-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026354#comment-14026354 ] Sean Owen commented on SPARK-2090: -- I'm assuming it's specific to your env or config as I don't see this behavior, and haven't in the past, and assume others aren't seeing it. Have you enabled the SecurityManager? if so what settings? are there other errors? It looks like SimpleReader doesn't echo input, so that's that, but the question is what caused Permission denied from the standard SparkJLineReader. Maybe you can temporarily modify that error log message to log the whole stack trace to see what it is? spark-shell input text entry not showing on REPL Key: SPARK-2090 URL: https://issues.apache.org/jira/browse/SPARK-2090 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 1.0.0 Environment: Ubuntu 14.04; Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60) Reporter: Richard Conway Priority: Critical Labels: easyfix, patch Fix For: 1.0.0 Original Estimate: 4h Remaining Estimate: 4h spark-shell doesn't allow text to be displayed on input Failed to created SparkJLineReader: java.io.IOException: Permission denied Falling back to SimpleReader. The driver has 2 workers on 2 virtual machines and error free apart from the above line so I think it may have something to do with the introduction of the new SecurityManager. The upshot is that when you type nothing is displayed on the screen. For example, type test at the scala prompt and you won't see the input but the output will show. scala console:11: error: package test is not a value test ^ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2103) Java + Kafka + Spark Streaming NoSuchMethodError in java.lang.Object.init
Sean Owen created SPARK-2103: Summary: Java + Kafka + Spark Streaming NoSuchMethodError in java.lang.Object.init Key: SPARK-2103 URL: https://issues.apache.org/jira/browse/SPARK-2103 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0 Reporter: Sean Owen This has come up a few times, from user venki-kratos: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-in-KafkaReciever-td2209.html and I ran into it a few weeks ago: http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3ccamassdlzs6ihctxepusphryxxa-wp26zgbxx83sm6niro0q...@mail.gmail.com%3E and yesterday user mpieck: {quote} When I use the createStream method from the example class like this: KafkaUtils.createStream(jssc, zookeeper:port, test, topicMap); everything is working fine, but when I explicitely specify message decoder classes used in this method with another overloaded createStream method: KafkaUtils.createStream(jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, props, topicMap, StorageLevels.MEMORY_AND_DISK_2); the applications stops with an error: 14/06/10 22:28:06 ERROR kafka.KafkaReceiver: Error receiving data java.lang.NoSuchMethodException: java.lang.Object.init(kafka.utils.VerifiableProperties) at java.lang.Class.getConstructor0(Unknown Source) at java.lang.Class.getConstructor(Unknown Source) at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:108) at org.apache.spark.streaming.dstream.NetworkReceiver.start(NetworkInputDStream.scala:126) {quote} Something is making it try to instantiate java.lang.Object as if it's a Decoder class. I suspect that the problem is to do with https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala#L148 {code} implicit val keyCmd: Manifest[U] = implicitly[Manifest[AnyRef]].asInstanceOf[Manifest[U]] implicit val valueCmd: Manifest[T] = implicitly[Manifest[AnyRef]].asInstanceOf[Manifest[T]] {code} ... where U and T are key/value Decoder types. I don't know enough Scala to fully understand this, but is it possible this causes the reflective call later to lose the type and try to instantiate Object? The AnyRef made me wonder. I am sorry to say I don't have a PR to suggest at this point. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2100) Allow users to disable Jetty Spark UI in local mode
[ https://issues.apache.org/jira/browse/SPARK-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032803#comment-14032803 ] Sean Owen commented on SPARK-2100: -- Tomcat and Jetty classes don't overlap -- do you mean the Servlet API classes? that's a different known issue. Allow users to disable Jetty Spark UI in local mode --- Key: SPARK-2100 URL: https://issues.apache.org/jira/browse/SPARK-2100 Project: Spark Issue Type: Improvement Reporter: DB Tsai Since we want to use Spark hadoop APIs in local mode for design time to explore the first couple hundred lines of data in HDFS. Also, we want to use Spark in our tomcat application, so starting a jetty UI will make our tomcat unhappy. In those scenarios, Spark UI is not necessary, and wasting resource. As a result, for local mode, it's desirable that users are able to disable the spark UI. Couple places I found where the jetty will be started. In SparkEnv.scala 1) val broadcastManager = new BroadcastManager(isDriver, conf, securityManager) 2) val httpFileServer = new HttpFileServer(securityManager) httpFileServer.initialize() I don't know if broadcastManager is needed in local mode tho. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2100) Allow users to disable Jetty Spark UI in local mode
[ https://issues.apache.org/jira/browse/SPARK-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033055#comment-14033055 ] Sean Owen edited comment on SPARK-2100 at 6/16/14 9:55 PM: --- Yes, the Maven build has to do a little work to exclude copies of the Servlet 2.x API. Spark ends up including one copy of the Servlet 3.0 APIs, which should make everybody happy. But if your build brings back in something else, and it's bringing its own Servlet API, you may need to exclude it. (This dependency is super annoying because different containers have distributed the same classes in different artifacts.) Advert break: SPARK-1949 fixes this type of issue for Spark's own SBT-based build. Not exactly the issue here but related, and would be cool to get it committed. https://issues.apache.org/jira/browse/SPARK-1949 was (Author: srowen): Yes, the Maven build has to do a little work to exclude copies of the Servlet 2.x API. Spark ends up including one copy of the Servlet 3.0 APIs, which should everybody happing. But if your build brings back in something else, and it's bringing its own Servlet API, you may need to exclude it. (This dependency is super annoying because different containers have distributed the same classes in different artifacts.) Advert break: SPARK-1949 fixes this type of issue for Spark's own SBT-based build. Not exactly the issue here but related, and would be cool to get it committed. https://issues.apache.org/jira/browse/SPARK-1949 Allow users to disable Jetty Spark UI in local mode --- Key: SPARK-2100 URL: https://issues.apache.org/jira/browse/SPARK-2100 Project: Spark Issue Type: Improvement Reporter: DB Tsai Since we want to use Spark hadoop APIs in local mode for design time to explore the first couple hundred lines of data in HDFS. Also, we want to use Spark in our tomcat application, so starting a jetty UI will make our tomcat unhappy. In those scenarios, Spark UI is not necessary, and wasting resource. As a result, for local mode, it's desirable that users are able to disable the spark UI. Couple places I found where the jetty will be started. In SparkEnv.scala 1) val broadcastManager = new BroadcastManager(isDriver, conf, securityManager) 2) val httpFileServer = new HttpFileServer(securityManager) httpFileServer.initialize() I don't know if broadcastManager is needed in local mode tho. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2100) Allow users to disable Jetty Spark UI in local mode
[ https://issues.apache.org/jira/browse/SPARK-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033055#comment-14033055 ] Sean Owen commented on SPARK-2100: -- Yes, the Maven build has to do a little work to exclude copies of the Servlet 2.x API. Spark ends up including one copy of the Servlet 3.0 APIs, which should everybody happing. But if your build brings back in something else, and it's bringing its own Servlet API, you may need to exclude it. (This dependency is super annoying because different containers have distributed the same classes in different artifacts.) Advert break: SPARK-1949 fixes this type of issue for Spark's own SBT-based build. Not exactly the issue here but related, and would be cool to get it committed. https://issues.apache.org/jira/browse/SPARK-1949 Allow users to disable Jetty Spark UI in local mode --- Key: SPARK-2100 URL: https://issues.apache.org/jira/browse/SPARK-2100 Project: Spark Issue Type: Improvement Reporter: DB Tsai Since we want to use Spark hadoop APIs in local mode for design time to explore the first couple hundred lines of data in HDFS. Also, we want to use Spark in our tomcat application, so starting a jetty UI will make our tomcat unhappy. In those scenarios, Spark UI is not necessary, and wasting resource. As a result, for local mode, it's desirable that users are able to disable the spark UI. Couple places I found where the jetty will be started. In SparkEnv.scala 1) val broadcastManager = new BroadcastManager(isDriver, conf, securityManager) 2) val httpFileServer = new HttpFileServer(securityManager) httpFileServer.initialize() I don't know if broadcastManager is needed in local mode tho. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2160) error of Decision tree algorithm in Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033550#comment-14033550 ] Sean Owen commented on SPARK-2160: -- You already added this as https://issues.apache.org/jira/browse/SPARK-2152 right? error of Decision tree algorithm in Spark MLlib -- Key: SPARK-2160 URL: https://issues.apache.org/jira/browse/SPARK-2160 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: caoli Labels: patch Fix For: 1.1.0 Original Estimate: 4h Remaining Estimate: 4h the error of comput rightNodeAgg about Decision tree algorithm in Spark MLlib , in the function extractLeftRightNodeAggregates() ,when compute rightNodeAgg used bindata index is error. in the DecisionTree.scala file about Line980: rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) = binData(shift + (2 * (numBins - 2 - splitIndex))) + rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex)) the binData(shift + (2 * (numBins - 2 - splitIndex))) index compute is error, so the result of rightNodeAgg include repeated data about bins -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038944#comment-14038944 ] Sean Owen commented on SPARK-2223: -- Is it not just because there are a lot of tests, and they generally can't be run in parallel? I agree, an hour for tests is really long, but I'm not sure if there is a problem except having lots of tests. You can run subsets of tests with Maven though to test targeted changes. Building and running tests with maven is extremely slow --- Key: SPARK-2223 URL: https://issues.apache.org/jira/browse/SPARK-2223 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Thomas Graves For some reason using maven with Spark is extremely slow. Building and running tests takes way longer then other projects I have used that use maven. We should investigate to see why. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2224) allow running tests for one sub module
[ https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038956#comment-14038956 ] Sean Owen commented on SPARK-2224: -- mvn test -pl [module], no? allow running tests for one sub module -- Key: SPARK-2224 URL: https://issues.apache.org/jira/browse/SPARK-2224 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.0.0 Reporter: Thomas Graves We should have a way to run just the unit tests in a submodule (like core or yarn, etc.). One way would be to support changing directories into the submodule and running mvn test from there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1568) Spark 0.9.0 hangs reading s3
[ https://issues.apache.org/jira/browse/SPARK-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039797#comment-14039797 ] Sean Owen commented on SPARK-1568: -- Sam, did the other recent changes to S3 deps resolve this, do you think? Spark 0.9.0 hangs reading s3 Key: SPARK-1568 URL: https://issues.apache.org/jira/browse/SPARK-1568 Project: Spark Issue Type: Bug Reporter: sam I've tried several jobs now and many of the tasks complete, then it get stuck and just hangs. The exact same jobs function perfectly fine if I distcp to hdfs first and read from hdfs. Many thanks -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039802#comment-14039802 ] Sean Owen commented on SPARK-2223: -- On a latest-generation Macbook Pro here, a full 'mvn clean install' takes 91:50 without zinc. With zinc, it's 51:02. Building and running tests with maven is extremely slow --- Key: SPARK-2223 URL: https://issues.apache.org/jira/browse/SPARK-2223 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Thomas Graves For some reason using maven with Spark is extremely slow. Building and running tests takes way longer then other projects I have used that use maven. We should investigate to see why. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1339) Build error: org.eclipse.paho:mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039804#comment-14039804 ] Sean Owen commented on SPARK-1339: -- (Just cruising some old issues) I can't reproduce this, and this is a general symptom of a repo not being accesible. It's actually nothing to do with mqtt-client per se. Also, we've fixed some repo issues along the way. Build error: org.eclipse.paho:mqtt-client - Key: SPARK-1339 URL: https://issues.apache.org/jira/browse/SPARK-1339 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.0 Reporter: Ken Williams Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. The Maven error is: {code} [ERROR] Failed to execute goal on project spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus {code} My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4. Is there an additional Maven repository I should add or something? If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, but I would really like to get the examples working because I haven't played with Spark before. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1138) Spark 0.9.0 does not work with Hadoop / HDFS
[ https://issues.apache.org/jira/browse/SPARK-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039805#comment-14039805 ] Sean Owen commented on SPARK-1138: -- This is no longer observed in the unit tests. The comments here say it was a Netty dependency problem, and I know that has since been cleaned up. Suggest this is resolved then? Spark 0.9.0 does not work with Hadoop / HDFS Key: SPARK-1138 URL: https://issues.apache.org/jira/browse/SPARK-1138 Project: Spark Issue Type: Bug Reporter: Sam Abeyratne UPDATE: This problem is certainly related to trying to use Spark 0.9.0 and the latest cloudera Hadoop / HDFS in the same jar. It seems no matter how I fiddle with the deps, the do not play nice together. I'm getting a java.util.concurrent.TimeoutException when trying to create a spark context with 0.9. I cannot, whatever I do, change the timeout. I've tried using System.setProperty, the SparkConf mechanism of creating a SparkContext and the -D flags when executing my jar. I seem to be able to run simple jobs from the spark-shell OK, but my more complicated jobs require external libraries so I need to build jars and execute them. Some code that causes this: println(Creating config) val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(MyApp) .setSparkHome(sparkHome) .set(spark.akka.askTimeout, parsed.getOrElse(timeouts, 100)) .set(spark.akka.timeout, parsed.getOrElse(timeouts, 100)) println(Creating sc) implicit val sc = new SparkContext(conf) The output: Creating config Creating sc log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. [ERROR] [02/26/2014 11:05:25.491] [main] [Remoting] Remoting error: [Startup timed out] [ akka.remote.RemoteTransportException: Startup timed out at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129) at akka.remote.Remoting.start(Remoting.scala:191) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126) at org.apache.spark.SparkContext.init(SparkContext.scala:139) at com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40) at com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:173) ... 11 more ] Exception in thread main java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:173) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126) at org.apache.spark.SparkContext.init(SparkContext.scala:139) at com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40) at
[jira] [Commented] (SPARK-1675) Make clear whether computePrincipalComponents centers data
[ https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039944#comment-14039944 ] Sean Owen commented on SPARK-1675: -- Is this still valid? Looking at the code, PCA is computed as the SVD of the covariance matrix. The means implicitly don't matter. they are not explicitly subtracted, and do not matter. Or is there still a doc change desired? Make clear whether computePrincipalComponents centers data -- Key: SPARK-1675 URL: https://issues.apache.org/jira/browse/SPARK-1675 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1846) RAT checks should exclude logs/ directory
[ https://issues.apache.org/jira/browse/SPARK-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039945#comment-14039945 ] Sean Owen commented on SPARK-1846: -- Just looking over some old JIRAs. This appears to be resolved already. logs is excluded. RAT checks should exclude logs/ directory - Key: SPARK-1846 URL: https://issues.apache.org/jira/browse/SPARK-1846 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Andrew Ash When there are logs in the logs/ directory, the rat check from ./dev/check-license fails. ``` aash@aash-mbp ~/git/spark$ find logs -type f logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5 logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2 aash@aash-mbp ~/git/spark$ ./dev/check-license Could not find Apache license headers in the following files: !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4 !? /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1 !? /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2 aash@aash-mbp ~/git/spark$ ``` -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1804) Mark 0.9.1 as released in JIRA
[ https://issues.apache.org/jira/browse/SPARK-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039947#comment-14039947 ] Sean Owen commented on SPARK-1804: -- Looks like this can be closed as resolved. https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:versions-panel Mark 0.9.1 as released in JIRA -- Key: SPARK-1804 URL: https://issues.apache.org/jira/browse/SPARK-1804 Project: Spark Issue Type: Task Components: Documentation, Project Infra Affects Versions: 0.9.1 Reporter: Stevo Slavic Priority: Trivial 0.9.1 has been released but is labeled as unreleased in SPARK JIRA project. Please have it marked as released. Also please document that step in release process. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1803) Rename test resources to be compatible with Windows FS
[ https://issues.apache.org/jira/browse/SPARK-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039948#comment-14039948 ] Sean Owen commented on SPARK-1803: -- PR was committed so this is another that seems to be closeable. Rename test resources to be compatible with Windows FS -- Key: SPARK-1803 URL: https://issues.apache.org/jira/browse/SPARK-1803 Project: Spark Issue Type: Task Components: Windows Affects Versions: 0.9.1 Reporter: Stevo Slavic Priority: Trivial {{git clone}} of master branch and then {{git status}} on Windows reports untracked files: {noformat} # Untracked files: # (use git add file... to include in what will be committed) # # sql/hive/src/test/resources/golden/Column pruning # sql/hive/src/test/resources/golden/Partition pruning # sql/hive/src/test/resources/golden/Partiton pruning {noformat} Actual issue is that several files under {{sql/hive/src/test/resources/golden}} directory have colon in name which is invalid character in file name on Windows. Please have these files renamed to a Windows compatible file name. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1046) Enable to build behind a proxy.
[ https://issues.apache.org/jira/browse/SPARK-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039950#comment-14039950 ] Sean Owen commented on SPARK-1046: -- Is this stale / resolved? I don't see this in the code at this point. Enable to build behind a proxy. --- Key: SPARK-1046 URL: https://issues.apache.org/jira/browse/SPARK-1046 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.8.1 Reporter: Kousuke Saruta Priority: Minor I tried to build spark-0.8.1 behind proxy and failed although I set http/https.proxyHost, proxyPort, proxyUser, proxyPassword. I found it's caused by accessing github using git protocol (git://). The URL is hard-corded in SparkPluginBuild.scala as follows. {code} lazy val junitXmlListener = uri(git://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce) {code} After I rewrite the URL as follows, I could build successfully. {code} lazy val junitXmlListener = uri(https://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce;) {code} I think we should be able to build whether we are behind a proxy or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-721) Fix remaining deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039951#comment-14039951 ] Sean Owen commented on SPARK-721: - This appears to be resolved as I don't think these warnings have been in the build for a while. Fix remaining deprecation warnings -- Key: SPARK-721 URL: https://issues.apache.org/jira/browse/SPARK-721 Project: Spark Issue Type: Improvement Affects Versions: 0.7.1 Reporter: Josh Rosen Assignee: Gary Struthers Priority: Minor Labels: Starter The recent patch to re-enable deprecation warnings fixed many of them, but there's still a few left; it would be nice to fix them. For example, here's one in RDDSuite: {code} [warn] /Users/joshrosen/Documents/spark/spark/core/src/test/scala/spark/RDDSuite.scala:32: method mapPartitionsWithSplit in class RDD is deprecated: use mapPartitionsWithIndex [warn] val partitionSumsWithSplit = nums.mapPartitionsWithSplit { [warn] ^ [warn] one warning found {code} Also, it looks like Scala 2.9 added a second deprecatedSince parameter to @Deprecated. We didn't fill this in, which causes some additional warnings: {code} [warn] /Users/joshrosen/Documents/spark/spark/core/src/main/scala/spark/RDD.scala:370: @deprecated now takes two arguments; see the scaladoc. [warn] @deprecated(use mapPartitionsWithIndex) [warn]^ [warn] one warning found {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1996) Remove use of special Maven repo for Akka
[ https://issues.apache.org/jira/browse/SPARK-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039965#comment-14039965 ] Sean Owen commented on SPARK-1996: -- PR: https://github.com/apache/spark/pull/1170 Remove use of special Maven repo for Akka - Key: SPARK-1996 URL: https://issues.apache.org/jira/browse/SPARK-1996 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Matei Zaharia Fix For: 1.0.1 According to http://doc.akka.io/docs/akka/2.3.3/intro/getting-started.html Akka is now published to Maven Central, so our documentation and POM files don't need to use the old Akka repo. It will be one less step for users to worry about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1316) Remove use of Commons IO
[ https://issues.apache.org/jira/browse/SPARK-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039991#comment-14039991 ] Sean Owen commented on SPARK-1316: -- PR: https://github.com/apache/spark/pull/1173 Actually, Commons IO is not even a dependency right now. Remove use of Commons IO Key: SPARK-1316 URL: https://issues.apache.org/jira/browse/SPARK-1316 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Sean Owen Priority: Minor (This follows from a side point on SPARK-1133, in discussion of the PR: https://github.com/apache/spark/pull/164 ) Commons IO is barely used in the project, and can easily be replaced with equivalent calls to Guava or the existing Spark Utils.scala class. Removing a dependency feels good, and this one in particular can get a little problematic since Hadoop uses it too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2249) how to convert rows of schemaRdd into HashMaps
[ https://issues.apache.org/jira/browse/SPARK-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041766#comment-14041766 ] Sean Owen commented on SPARK-2249: -- This is where issues are reported, rather than where questions are asked. I think this should be closed. Use the user@ mailing list instead. how to convert rows of schemaRdd into HashMaps -- Key: SPARK-2249 URL: https://issues.apache.org/jira/browse/SPARK-2249 Project: Spark Issue Type: Question Reporter: jackielihf spark 1.0.0 how to convert rows of schemaRdd into HashMaps using column names as keys? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2257) The algorithm of ALS in mlib lacks a parameter
[ https://issues.apache.org/jira/browse/SPARK-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042025#comment-14042025 ] Sean Owen commented on SPARK-2257: -- I don't think this is a bug, in the sense that it is just a different formulation of ALS. It's in the ALS-WR paper, but not the more well-known Hu/Koren/Volinsky paper. This is weighted regularization and it does help in some cases. In fact, it's already implemented in MLlib, although went in just after 1.0.0: https://github.com/apache/spark/commit/a6e0afdcf0174425e8a6ff20b2bc2e3a7a374f19#diff-2b593e0b4bd6eddab37f04968baa826c I think this is therefore already implemented. The algorithm of ALS in mlib lacks a parameter --- Key: SPARK-2257 URL: https://issues.apache.org/jira/browse/SPARK-2257 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Environment: spark 1.0 Reporter: zhengbing li Labels: patch Fix For: 1.1.0 Original Estimate: 336h Remaining Estimate: 336h When I test ALS algorithm using netflix data, I find I cannot get the acurate results declared by the paper. The best MSE value is 0.9066300038109709(RMSE 0.952), which is worse than the paper's result. If I increase the number of features or the number of iterations, I will get a worse result. After I studing the paper and source code, I find a bug in the updateBlock function of ALS. orgin code is: while (i rank) { // --- fullXtX.data(i * rank + i) += lambda i += 1 } The code doesn't consider the number of products that one user rates. So this code should be modified: while (i rank) { //ratingsNum(index) equals the number of products that a user rates fullXtX.data(i * rank + i) += lambda * ratingsNum(index) i += 1 } After I modify code, the MSE value has been decreased, this is one test result conditions: val numIterations =20 val features = 30 val model = ALS.train(trainRatings,features, numIterations, 0.06) result of modified version: MSE: Double = 0.8472313396478773 RMSE: 0.92045 results of version of 1.0 MSE: Double = 1.2680743123043832 RMSE: 1.1261 In order to add the vector ratingsNum, I want to change the InLinkBlock structure as follows: private[recommendation] case class InLinkBlock(elementIds: Array[Int], ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]]) So I could calculte the vector ratingsNum in the function of makeInLinkBlock. This is the code I add in the makeInLinkBlock: ... //added val ratingsNum = new Array[Int](numUsers) ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1) //end of added InLinkBlock(userIds, ratingsNum, ratingsForBlock) Is this solution reasonable?? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2251) MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number of elements in each partition
[ https://issues.apache.org/jira/browse/SPARK-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042738#comment-14042738 ] Sean Owen commented on SPARK-2251: -- For what it's worth, I can reproduce this. In the sample, the test RDD has 2 partitions, containing 2 and 1 examples. The prediction RDD has 2 partitions, containing 1 and 2 examples respectively. So they aren't matched up, even though one is a 1-1 map() of the other. That seems like it shouldn't happen? maybe someone more knowledgeable can say whether that itself should occur. test is a PartitionwiseSampledRDD and prediction is a MappedRDD of course. If it is allowed to happen, then the example should be fixed, and I could easily supply a patch. It can be done without having to zip up RDDs to begin with. MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number of elements in each partition -- Key: SPARK-2251 URL: https://issues.apache.org/jira/browse/SPARK-2251 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Environment: OS: Fedora Linux Spark Version: 1.0.0. Git clone from the Spark Repository Reporter: Jun Xie Priority: Minor Labels: Naive-Bayes I follow the exact code from Naive Bayes Example (http://spark.apache.org/docs/latest/mllib-naive-bayes.html) of MLLib. When I executed the final command: val accuracy = 1.0 * predictionAndLabel.filter(x = x._1 == x._2).count() / test.count() It complains Can only zip RDDs with same number of elements in each partition. I got the following exception: 14/06/23 19:39:23 INFO SparkContext: Starting job: count at console:31 14/06/23 19:39:23 INFO DAGScheduler: Got job 3 (count at console:31) with 2 output partitions (allowLocal=false) 14/06/23 19:39:23 INFO DAGScheduler: Final stage: Stage 4(count at console:31) 14/06/23 19:39:23 INFO DAGScheduler: Parents of final stage: List() 14/06/23 19:39:23 INFO DAGScheduler: Missing parents: List() 14/06/23 19:39:23 INFO DAGScheduler: Submitting Stage 4 (FilteredRDD[14] at filter at console:31), which has no missing parents 14/06/23 19:39:23 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 (FilteredRDD[14] at filter at console:31) 14/06/23 19:39:23 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:0 as TID 8 on executor localhost: localhost (PROCESS_LOCAL) 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:0 as 3410 bytes in 0 ms 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:1 as TID 9 on executor localhost: localhost (PROCESS_LOCAL) 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:1 as 3410 bytes in 1 ms 14/06/23 19:39:23 INFO Executor: Running task ID 8 14/06/23 19:39:23 INFO Executor: Running task ID 9 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24 14/06/23 19:39:23 ERROR Executor: Exception in task ID 9 org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:663) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1067) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 14/06/23 19:39:23 ERROR Executor: Exception in task ID 8
[jira] [Commented] (SPARK-2268) Utils.createTempDir() creates race with HDFS at shutdown
[ https://issues.apache.org/jira/browse/SPARK-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043102#comment-14043102 ] Sean Owen commented on SPARK-2268: -- Yeah this hook is deleting local files, not HDFS files. I don't think it can interact with Hadoop APIs or else it fails when used without Hadoop. Utils.createTempDir() creates race with HDFS at shutdown Key: SPARK-2268 URL: https://issues.apache.org/jira/browse/SPARK-2268 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin Utils.createTempDir() has this code: {code} // Add a shutdown hook to delete the temp dir when the JVM exits Runtime.getRuntime.addShutdownHook(new Thread(delete Spark temp dir + dir) { override def run() { // Attempt to delete if some patch which is parent of this is not already registered. if (! hasRootAsShutdownDeleteDir(dir)) Utils.deleteRecursively(dir) } }) {code} This creates a race with the shutdown hooks registered by HDFS, since the order of execution is undefined; if the HDFS hooks run first, you'll get exceptions about the file system being closed. Instead, this should use Hadoop's ShutdownHookManager with a proper priority, so that it runs before the HDFS hooks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2251) MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number of elements in each partition
[ https://issues.apache.org/jira/browse/SPARK-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043169#comment-14043169 ] Sean Owen commented on SPARK-2251: -- Well the change to the examples is pretty straightforward. Instead of separately computing predictions, you just: {code} val predictionAndLabel = test.map(x = (model.predict(x.features), x.label)) {code} ... and similarly for other languages, and other examples. In fact it seems more straightforward. But I am wondering if this is actually a bug in PartitionwiseSampledRDD. [~mengxr] is this a bit of code you wrote or are familiar with? MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number of elements in each partition -- Key: SPARK-2251 URL: https://issues.apache.org/jira/browse/SPARK-2251 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Environment: OS: Fedora Linux Spark Version: 1.0.0. Git clone from the Spark Repository Reporter: Jun Xie Priority: Minor Labels: Naive-Bayes I follow the exact code from Naive Bayes Example (http://spark.apache.org/docs/latest/mllib-naive-bayes.html) of MLLib. When I executed the final command: val accuracy = 1.0 * predictionAndLabel.filter(x = x._1 == x._2).count() / test.count() It complains Can only zip RDDs with same number of elements in each partition. I got the following exception: 14/06/23 19:39:23 INFO SparkContext: Starting job: count at console:31 14/06/23 19:39:23 INFO DAGScheduler: Got job 3 (count at console:31) with 2 output partitions (allowLocal=false) 14/06/23 19:39:23 INFO DAGScheduler: Final stage: Stage 4(count at console:31) 14/06/23 19:39:23 INFO DAGScheduler: Parents of final stage: List() 14/06/23 19:39:23 INFO DAGScheduler: Missing parents: List() 14/06/23 19:39:23 INFO DAGScheduler: Submitting Stage 4 (FilteredRDD[14] at filter at console:31), which has no missing parents 14/06/23 19:39:23 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 (FilteredRDD[14] at filter at console:31) 14/06/23 19:39:23 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:0 as TID 8 on executor localhost: localhost (PROCESS_LOCAL) 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:0 as 3410 bytes in 0 ms 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:1 as TID 9 on executor localhost: localhost (PROCESS_LOCAL) 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:1 as 3410 bytes in 1 ms 14/06/23 19:39:23 INFO Executor: Running task ID 8 14/06/23 19:39:23 INFO Executor: Running task ID 9 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24 14/06/23 19:39:23 ERROR Executor: Exception in task ID 9 org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:663) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1067) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 14/06/23 19:39:23 ERROR Executor: Exception in task ID 8 org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:663) at
[jira] [Comment Edited] (SPARK-2293) Replace RDD.zip usage by map with predict inside.
[ https://issues.apache.org/jira/browse/SPARK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045874#comment-14045874 ] Sean Owen edited comment on SPARK-2293 at 6/27/14 8:53 PM: --- I can make a PR for the example changes since I was already looking at this, unless you've already got it done. As for a new method -- kind of a toss-up between the small added convenience and adding another method to the API. For my part I found it clear to just write a map call. https://github.com/apache/spark/pull/1250 was (Author: srowen): I can make a PR for the example changes since I was already looking at this, unless you've already got it done. As for a new method -- kind of a toss-up between the small added convenience and adding another method to the API. For my part I found it clear to just write a map call. Replace RDD.zip usage by map with predict inside. - Key: SPARK-2293 URL: https://issues.apache.org/jira/browse/SPARK-2293 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Priority: Minor In our guide, we use {code} val prediction = model.predict(test.map(_.features)) val predictionAndLabel = prediction.zip(test.map(_.label)) {code} This is not efficient because test will be computed twice. We should change it to {code} val predictionAndLabel = test.map(p = (model.predict(p.features), p.label)) {code} It is also nice to add a `predictWith` method to predictive models. {code} def predictWith[V](RDD[(Vector, V)]): RDD[(Double, V)] {code} But I'm not sure whether this is a good name. `predictWithValue`? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1945) Add full Java examples in MLlib docs
[ https://issues.apache.org/jira/browse/SPARK-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047101#comment-14047101 ] Sean Owen commented on SPARK-1945: -- As an aside, there is also JavaPairRDD, which is a better specialization of JavaRDDTuple2?,?. But here you're trying to call Scala APIs so you can't make it expose JavaPairRDD. For that there would need to be a specialized Java version of this API. Add full Java examples in MLlib docs Key: SPARK-1945 URL: https://issues.apache.org/jira/browse/SPARK-1945 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Matei Zaharia Labels: Starter Fix For: 1.0.0 Right now some of the Java tabs only say the following: All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object. Would be nice to translate the Scala code into Java instead. Also, a few pages (most notably the Matrix one) don't have Java examples at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2330) Spark shell has weird scala semantics
[ https://issues.apache.org/jira/browse/SPARK-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047689#comment-14047689 ] Sean Owen commented on SPARK-2330: -- I can't reproduce this in HEAD right now. Try that? This also sounds like a potential duplicate of https://issues.apache.org/jira/browse/SPARK-1199 Spark shell has weird scala semantics - Key: SPARK-2330 URL: https://issues.apache.org/jira/browse/SPARK-2330 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1, 1.0.0 Environment: Ubuntu 14.04 with spark-x.x.x-bin-hadoop2 Reporter: Andrea Ferretti Labels: scala, shell Normal scala expressions are interpreted in a strange way in the spark shell. For instance {noformat} case class Foo(x: Int) def print(f: Foo) = f.x val f = Foo(3) print(f) console:24: error: type mismatch; found : Foo required: Foo {noformat} For another example {noformat} trait Currency case object EUR extends Currency case object USD extends Currency def nextCurrency: Currency = nextInt(2) match { case 0 = EUR case _ = USD } console:22: error: type mismatch; found : EUR.type required: Currency case 0 = EUR console:24: error: type mismatch; found : USD.type required: Currency case _ = USD {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2331) SparkContext.emptyRDD has wrong return type
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14048041#comment-14048041 ] Sean Owen edited comment on SPARK-2331 at 6/30/14 7:48 PM: --- My 2 cents. {code} val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} yields {code} console:12: error: type mismatch; found : org.apache.spark.rdd.RDD[String] required: org.apache.spark.rdd.RDD[Nothing] Note: String : Nothing, but class RDD is invariant in type T. You may wish to define T as -T instead. (SLS 4.5) val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} but even {code} val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} yields {code} console:12: error: type mismatch; found : org.apache.spark.rdd.RDD[String] required: org.apache.spark.rdd.EmptyRDD[String] val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} So I think this is really about RDDs being invariant, rather than the return type here, and that seems to be how it's going to be: https://issues.apache.org/jira/browse/SPARK-1296 I think there's an argument for hiding EmptyRDD although that would be a little API change at this point. was (Author: srowen): My 2 cents. You mean the type is EmptyRDD[Nothing] right? {code} val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} yields {code} console:12: error: type mismatch; found : org.apache.spark.rdd.RDD[String] required: org.apache.spark.rdd.RDD[Nothing] Note: String : Nothing, but class RDD is invariant in type T. You may wish to define T as -T instead. (SLS 4.5) val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} but even {code} val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} yields {code} console:12: error: type mismatch; found : org.apache.spark.rdd.RDD[String] required: org.apache.spark.rdd.EmptyRDD[String] val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} So I think this is really about RDDs being invariant, rather than the return type here, and that seems to be how it's going to be: https://issues.apache.org/jira/browse/SPARK-1296 I think there's an argument for hiding EmptyRDD although that would be a little API change at this point. SparkContext.emptyRDD has wrong return type --- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Ian Hummel The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD has wrong return type
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14048041#comment-14048041 ] Sean Owen commented on SPARK-2331: -- My 2 cents. You mean the type is EmptyRDD[Nothing] right? {code} val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} yields {code} console:12: error: type mismatch; found : org.apache.spark.rdd.RDD[String] required: org.apache.spark.rdd.RDD[Nothing] Note: String : Nothing, but class RDD is invariant in type T. You may wish to define T as -T instead. (SLS 4.5) val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} but even {code} val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} yields {code} console:12: error: type mismatch; found : org.apache.spark.rdd.RDD[String] required: org.apache.spark.rdd.EmptyRDD[String] val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = rdd.union(sc.textFile(path)) } {code} So I think this is really about RDDs being invariant, rather than the return type here, and that seems to be how it's going to be: https://issues.apache.org/jira/browse/SPARK-1296 I think there's an argument for hiding EmptyRDD although that would be a little API change at this point. SparkContext.emptyRDD has wrong return type --- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Ian Hummel The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } -- This message was sent by Atlassian JIRA (v6.2#6252)