Re: ALS.trainImplicit running out of mem when using higher rank
If you are running spark in local mode, executor parameters are not used as there is no executor. You should try to set corresponding driver parameter to effect it. On Mon, Jan 19, 2015, 00:21 Sean Owen so...@cloudera.com wrote: OK. Are you sure the executor has the memory you think? -Xmx24g in its command line? It may be that for some reason your job is reserving an exceptionally large amount of non-heap memory. I am not sure that's to be expected with the ALS job though. Even if the settings work, considering using the explicit command line configuration. On Sat, Jan 17, 2015 at 12:49 PM, Antony Mayi antonym...@yahoo.com wrote: the values are for sure applied as expected - confirmed using the spark UI environment page... it comes from my defaults configured using 'spark.yarn.executor.memoryOverhead=8192' (yes, now increased even more) in /etc/spark/conf/spark-defaults.conf and 'export SPARK_EXECUTOR_MEMORY=24G' in /etc/spark/conf/spark-env.sh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: running a job on ec2 Initial job has not accepted any resources
Just make sure both versions of spark are same (the one from where you are submitting the job, and the one to which you are submitting the job). Another reason would be firewall issues if you are submitting the job from another network/remote machine. Thanks Best Regards On Sun, Jan 18, 2015 at 6:02 PM, Grzegorz Dubicki grzegorz.dubi...@gmail.com wrote: Hi mehrdad, I seem to have the same issue as you wrote about here. Did you manage to resolve it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/running-a-job-on-ec2-Initial-job-has-not-accepted-any-resources-tp20607p21218.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Join DStream With Other Datasets
Hi Sean, Thanks for your advice, a normal 'val' will suffice. But will it be serialized and transferred every batch and every partition? That's why broadcast exists, right? For now I'm going to use 'val', but I'm still looking for a broadcast-way solution. On Sun, Jan 18, 2015 at 5:36 PM, Sean Owen so...@cloudera.com wrote: I think that this problem is not Spark-specific since you are simply side loading some data into memory. Therefore you do not need an answer that uses Spark. Simply load the data and then poll for an update each time it is accessed? Or some reasonable interval? This is just something you write in Java/Scala. On Jan 17, 2015 2:06 PM, Ji ZHANG zhangj...@gmail.com wrote: Hi, I want to join a DStream with some other dataset, e.g. join a click stream with a spam ip list. I can think of two possible solutions, one is use broadcast variable, and the other is use transform operation as is described in the manual. But the problem is the spam ip list will be updated outside of the spark streaming program, so how can it be noticed to reload the list? For broadcast variables, they are immutable. For transform operation, is it costly to reload the RDD on every batch? If it is, and I use RDD.persist(), does it mean I need to launch a thread to regularly unpersist it so that it can get the updates? Any ideas will be appreciated. Thanks. -- Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Jerry
Re: maven doesn't build dependencies with Scala 2.11
bq. there was no 2.11 Kafka available That's right. Adding external/kafka module resulted in: [ERROR] Failed to execute goal on project spark-streaming-kafka_2.11: Could not resolve dependencies for project org.apache.spark:spark-streaming-kafka_2.11:jar:1.3.0-SNAPSHOT: Could not find artifact org.apache.kafka:kafka_2.11:jar:0.8.0 in central ( https://repo1.maven.org/maven2) - [Help 1] Cheers On Sun, Jan 18, 2015 at 10:41 AM, Sean Owen so...@cloudera.com wrote: I could be wrong, but I thought this was on purpose. At the time it was set up, there was no 2.11 Kafka available? or one of its dependencies wouldn't work with 2.11? But I'm not sure what the OP means by maven doesn't build Spark's dependencies because Ted indicates it does, and of course you can see that these artifacts are published. On Sun, Jan 18, 2015 at 2:46 AM, Ted Yu yuzhih...@gmail.com wrote: There're 3 jars under lib_managed/jars directory with and without -Dscala-2.11 flag. Difference between scala-2.10 and scala-2.11 profiles is that scala-2.10 profile has the following: modules moduleexternal/kafka/module /modules FYI On Sat, Jan 17, 2015 at 4:07 PM, Ted Yu yuzhih...@gmail.com wrote: I did the following: 1655 dev/change-version-to-2.11.sh 1657 mvn -DHADOOP_PROFILE=hadoop-2.4 -Pyarn,hive -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package And mvn command passed. Did you see any cross-compilation errors ? Cheers BTW the two links you mentioned are consistent in terms of building for Scala 2.11 On Sat, Jan 17, 2015 at 3:43 PM, Walrus theCat walrusthe...@gmail.com wrote: Hi, When I run this: dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package as per here, maven doesn't build Spark's dependencies. Only when I run: dev/change-version-to-2.11.sh sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package as gathered from here, do I get Spark's dependencies built without any cross-compilation errors. Question: - How can I make maven do this? - How can I specify the use of Scala 2.11 in my own .pom files? Thanks
RE: Streaming with Java: Expected ReduceByWindow to Return JavaDStream
Hi Jeff, From my understanding it seems more like a bug, since JavaDStreamLike is used for Java code, return a Scala DStream is not reasonable. You can fix this by submitting a PR, or I can help you to fix this. Thanks Jerry From: Jeff Nadler [mailto:jnad...@srcginc.com] Sent: Monday, January 19, 2015 2:04 PM To: user@spark.apache.org Subject: Streaming with Java: Expected ReduceByWindow to Return JavaDStream Can anyone tell me if my expectations are sane? I'm trying to do a reduceByWindow using the 3-arg signature (not providing an inverse reduce function): JavaDStreamwhatevs reducedStream = messages.reduceByWindow((x, y) - reduce(x, y), Durations.seconds(5), Durations.seconds(5)); This isn't building; looks like it's returning DStream not JavaDStream. From JavaDStreamLike.scala, looks like this sig returns DStream, the 4-arg sig with the inverse reduce returns JavaDStream. def reduceByWindow( reduceFunc: (T, T) = T, windowDuration: Duration, slideDuration: Duration ): DStream[T] = { dstream.reduceByWindow(reduceFunc, windowDuration, slideDuration) } So I'm just a noob. Is this a bug or am I missing something? Thanks! Jeff Nadler
RE: SparkSQL 1.2.0 sources API error
It seems the netty jar works with an incompatible method signature. Can you check if there different versions of netty jar in your classpath? From: Walrus theCat [mailto:walrusthe...@gmail.com] Sent: Sunday, January 18, 2015 3:37 PM To: user@spark.apache.org Subject: Re: SparkSQL 1.2.0 sources API error I'm getting this also, with Scala 2.11 and Scala 2.10: 15/01/18 07:34:51 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/01/18 07:34:51 INFO Remoting: Starting remoting 15/01/18 07:34:51 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-7] shutting down ActorSystem [sparkDriver] java.lang.NoSuchMethodError: org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V at akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:283) at akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:240) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78) at scala.util.Try$.apply(Try.scala:161) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84) at scala.util.Success.flatMap(Try.scala:200) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84) at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:692) at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:684) at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721) at akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:684) at akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:492) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. Exception in thread main java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:180) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632) at akka.actor.ActorSystem$.apply(ActorSystem.scala:141) at akka.actor.ActorSystem$.apply(ActorSystem.scala:118) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
Re: SparkSQL 1.2.0 sources API error
NioWorkerPool(Executor workerExecutor, int workerCount) was added in netty 3.5.4 https://github.com/netty/netty/blob/netty-3.5.4.Final/src/main/java/org/jboss/netty/channel/socket/nio/NioWorkerPool.java If there is a netty jar in the classpath older than the above release, you would see the exception. Cheers On Sun, Jan 18, 2015 at 4:49 PM, Cheng, Hao hao.ch...@intel.com wrote: It seems the netty jar works with an incompatible method signature. Can you check if there different versions of netty jar in your classpath? *From:* Walrus theCat [mailto:walrusthe...@gmail.com] *Sent:* Sunday, January 18, 2015 3:37 PM *To:* user@spark.apache.org *Subject:* Re: SparkSQL 1.2.0 sources API error I'm getting this also, with Scala 2.11 and Scala 2.10: 15/01/18 07:34:51 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/01/18 07:34:51 INFO Remoting: Starting remoting 15/01/18 07:34:51 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-7] shutting down ActorSystem [sparkDriver] java.lang.NoSuchMethodError: org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V at akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:283) at akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:240) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78) at scala.util.Try$.apply(Try.scala:161) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84) at akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84) at scala.util.Success.flatMap(Try.scala:200) at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84) at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:692) at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:684) at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721) at akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:684) at akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:492) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. Exception in thread main java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:180) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618) at
Re: Recent Git Builds Application WebUI Problem and Exception Stating Log directory /tmp/spark-events does not exist.
This looks like a bug in the master branch of Spark, related to some recent changes to EventLoggingListener. You can reproduce this bug on a fresh Spark checkout by running ./bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/nonexistent-dir where /tmp/nonexistent-dir is a directory that doesn't exist and /tmp exists. It looks like older versions of EventLoggingListener would create the directory if it didn't exist. I think the issue here is that the error-checking code is overzealous and catches some non-error conditions, too; I've filed https://issues.apache.org/jira/browse/SPARK-5311 to investigate this. On Sun, Jan 18, 2015 at 1:59 PM, Ganon Pierce ganon.pie...@me.com wrote: I posted about the Application WebUI error (specifically application WebUI not the master WebUI generally) and have spent at least a few hours a day for over week trying to resolve it so I’d be very grateful for any suggestions. It is quite troubling that I appear to be the only one encountering this issue and I’ve tried to include everything here which might be relevant (sorry for the length). Please see the thread Current Build Gives HTTP ERROR” https://www.mail-archive.com/user@spark.apache.org/msg18752.html to see specifics about the application webUI issue and the master log. Environment: I’m doing my spark builds and application programming in scala locally on my macbook pro in eclipse, using modified ec2 launch scripts to launch my cluster, uploading my spark builds and models to s3, and uploading applications to and submitting them from ec2. I’m using java 8 locally and also installing and using java 8 on my ec2 instances (which works with spark 1.2.0). I have a windows machine at home (macbook is work machine), but have not yet attempted to launch from there. Errors: I’ve built two different recent git versions of spark both multiple times, and when running applications both have produced an Application WebUI error and this exception: Exception in thread main java.lang.IllegalArgumentException: Log directory /tmp/spark-events does not exist. While both will display the master webUI just fine including running/completed applications, registered workers etc, when I try to access a running or completed application’s WebUI by clicking their respective link, I receive a server error. When I manually create the above log directory, the exception goes away, but the WebUI problem does not. I don’t have any strong evidence, but I suspect these errors and whatever is causing them are related. Why and How of Modifications to Launch Scripts for Installation of Unreleased Spark Versions: When using a prebuilt version of spark on my cluster everything works except the new methods I need, which I had previously added to my custom version of spark and used by building the spark-assembly.jar locally and then replacing the assembly file produced through the 1.1.0 ec2 launch scripts. However, since my pull request was accepted and can now be found in the apache/spark repository along with some additional features I’d like to use and because I’d like a more elegant permanent solution for launching a cluster and installing unreleased versions of spark to my ec2 clusters, I’ve modified the included ec2 launch scripts in this way (credit to gen tang here: https://www.mail-archive.com/user%40spark.apache.org/msg18761.html https://www.mail-archive.com/user@spark.apache.org/msg18761.html): 1. Clone the most recent git version of spark 2. Use the make-dist script 3. Tar the dist folder and upload the resulting spark-1.3.0-snapshot-hadoop1.tgz to s3 and change file permissions 4. Fork the mesos/spark-ec2 repository and modify the spark/init.sh script to do a wget of my hosted distribution instead of spark’s stable release 5. Modify my spark_ec2.py script to point to my repository. 6. Modify my spark_ec2.py script to install java 8 on my ec2 instances. (This works and does not produce the above stated errors when using a stable release like 1.2.0). Additional Possibly Related Info: As far as I can tell (I went through line by line), when I launch my recent build vs when I launch the most recent stable release the console prints almost identical INFO and WARNINGS except where you would expect things to be different e.g. version numbers. I’ve noted that after launch the prebuilt stable version does not have a /tmp/spark-events directory, but it is created when the application is launched, while it is never created in my build. Further, in my unreleased builds the application logs that I find are always stored as .inprogress files (when I set the logging directory to /root/ or add the /tmp/spark-events directory manually) even after completion, which I believe is supposed to change to .completed (or something similar) when the application finishes. Thanks for any help!
Re: No Output
The error in the log file says: *java.lang.OutOfMemoryError: GC overhead limit exceeded* with certain task ID and the error repeats for further task IDs. What could be the problem? On Sun, Jan 18, 2015 at 2:45 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Updating the Spark version means setting up the entire cluster once more? Or can we update it in some other way? On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste the code? Also you can try updating your spark version. Thanks Best Regards On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark-1.0.0 in a single node cluster. When I run a job with small data set it runs perfectly but when I use a data set of 350 KB, no output is being produced and when I try to run it the second time it is giving me an exception telling that SparkContext was shut down. Can anyone help me on this? Thank You
Re: running a job on ec2 Initial job has not accepted any resources
Hi mehrdad, I seem to have the same issue as you wrote about here. Did you manage to resolve it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/running-a-job-on-ec2-Initial-job-has-not-accepted-any-resources-tp20607p21218.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Avoid broacasting huge variables
The singleton hack works very different in spark 1.2.0 (it does not work if the program has multiple map-reduce jobs in the same program). I guess there should be an official documentation on how to have each machine/node do an init step locally before executing any other instructions (e.g. loading locally a very big object once at the begining that can be used in all further map jobs that will be assigned to that worker). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Avoid-broacasting-huge-variables-tp14696p21220.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Directory / File Reading Patterns
Also, I used the following pattern to extract information from a file path and add it to the output of a transformation: https://gist.github.com/btiernay/1ad5e3dea08904fe07d9 You may find it useful as well. Cheers, Bob From: btier...@hotmail.com To: so...@cloudera.com; snu...@hortonworks.com CC: user@spark.apache.org Subject: RE: Directory / File Reading Patterns Date: Sun, 18 Jan 2015 15:41:53 + You may also want to keep an eye on SPARK-5182 / SPARK-5302 which may help if you are using Spark SQL. It should be noted that this is possible with HiveContext today. Cheers, Bob Date: Sun, 18 Jan 2015 08:47:06 + Subject: Re: Directory / File Reading Patterns From: so...@cloudera.com To: snu...@hortonworks.com CC: user@spark.apache.org I think that putting part of the data (only) in a filename is an anti-pattern, but we sometimes have to play these where they lie. You can list all the directory paths containing the CSV files, map them each to RDDs with textFile, transform the RDDs to include info from the path, and then simply union them. This should be pretty fine performance wise even. On Jan 17, 2015 11:48 PM, Steve Nunez snu...@hortonworks.com wrote: Hello Users, I’ve got a real-world use case that seems common enough that its pattern would be documented somewhere, but I can’t find any references to a simple solution. The challenge is that data is getting dumped into a directory structure, and that directory structure itself contains features that I need in my model. For example: bank_code Trader Day-1.csv Day-2.csv … Each CVS file contains a list of all the trades made by that individual each day. The problem is that the bank trader should be part of the feature set. I.e. We need the RDD to look like: (bank, trader, day, list-of-trades) Anyone got any elegant solutions for doing this? Cheers, - SteveN
Re: Reducer memory exceeded
I think the problem is that you have a single object that is larger than 2GB and so fails to serialize to a byte array. I think it is best not to design it this way as you can't parallelize combining maps. You could go all the way to emit key value pairs and reduceByKey. There are solutions between these two as well. On Jan 18, 2015 1:47 PM, octavian.ganea octavian.ga...@inf.ethz.ch wrote: Hi, Please help me with this problem. I would really appreciate your help ! I am using spark 1.2.0. I have a map-reduce job written in spark in the following way: val sumW = splittedTrainingDataRDD.map(localTrainingData = LocalSGD(w, localTrainingData, numeratorCtEta, numitorCtEta, regularizer, 0.2).reduce((w1,w2) = {w1.add(w2); w2.clear; w1}) Here, w is trove TLongDoubleHashMap containing no more than 50 million elements (in RAM this is ~ 15 GB). w1.add(w2) does addition of the values of the same key, for each key of both maps. My initial configuration is: conf.set(spark.cores.max, 16) conf.set(spark.akka.frameSize, 10) conf.set(spark.executor.memory, 120g) conf.set(spark.reducer.maxMbInFlight, 10) conf.set(spark.storage.memoryFraction, 0.9) conf.set(spark.shuffle.file.buffer.kb, 1000) conf.set(spark.broadcast.factory, org.apache.spark.broadcast.HttpBroadcastFactory) conf.set(spark.driver.maxResultSize, 120g) val sc = new SparkContext(conf) I am running this on a cluster with 8 machines, each machine has 16 cores and 130 GB RAM. My spark-env.sh contains: ulimit -n 20 SPARK_JAVA_OPTS=-Xms120G -Xmx120G -XX:-UseGCOverheadLimit -XX:-UseCompressedOops SPARK_DRIVER_MEMORY=120G The error I get is at the reducer above (the reducer above is in file called Learning.scala, line 313): 15/01/18 14:35:52 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:836) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 15/01/18 14:35:52 INFO DAGScheduler: Job 2 failed: reduce at Learning.scala:313, took 54.657239 s Exception in thread main org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701) at
RE: Directory / File Reading Patterns
You may also want to keep an eye on SPARK-5182 / SPARK-5302 which may help if you are using Spark SQL. It should be noted that this is possible with HiveContext today. Cheers, Bob Date: Sun, 18 Jan 2015 08:47:06 + Subject: Re: Directory / File Reading Patterns From: so...@cloudera.com To: snu...@hortonworks.com CC: user@spark.apache.org I think that putting part of the data (only) in a filename is an anti-pattern, but we sometimes have to play these where they lie. You can list all the directory paths containing the CSV files, map them each to RDDs with textFile, transform the RDDs to include info from the path, and then simply union them. This should be pretty fine performance wise even. On Jan 17, 2015 11:48 PM, Steve Nunez snu...@hortonworks.com wrote: Hello Users, I’ve got a real-world use case that seems common enough that its pattern would be documented somewhere, but I can’t find any references to a simple solution. The challenge is that data is getting dumped into a directory structure, and that directory structure itself contains features that I need in my model. For example: bank_code Trader Day-1.csv Day-2.csv … Each CVS file contains a list of all the trades made by that individual each day. The problem is that the bank trader should be part of the feature set. I.e. We need the RDD to look like: (bank, trader, day, list-of-trades) Anyone got any elegant solutions for doing this? Cheers, - SteveN
Reducer memory exceeded
Hi, Please help me with this problem. I would really appreciate your help ! I am using spark 1.2.0. I have a map-reduce job written in spark in the following way: val sumW = splittedTrainingDataRDD.map(localTrainingData = LocalSGD(w, localTrainingData, numeratorCtEta, numitorCtEta, regularizer, 0.2).reduce((w1,w2) = {w1.add(w2); w2.clear; w1}) Here, w is trove TLongDoubleHashMap containing no more than 50 million elements (in RAM this is ~ 15 GB). w1.add(w2) does addition of the values of the same key, for each key of both maps. My initial configuration is: conf.set(spark.cores.max, 16) conf.set(spark.akka.frameSize, 10) conf.set(spark.executor.memory, 120g) conf.set(spark.reducer.maxMbInFlight, 10) conf.set(spark.storage.memoryFraction, 0.9) conf.set(spark.shuffle.file.buffer.kb, 1000) conf.set(spark.broadcast.factory, org.apache.spark.broadcast.HttpBroadcastFactory) conf.set(spark.driver.maxResultSize, 120g) val sc = new SparkContext(conf) I am running this on a cluster with 8 machines, each machine has 16 cores and 130 GB RAM. My spark-env.sh contains: ulimit -n 20 SPARK_JAVA_OPTS=-Xms120G -Xmx120G -XX:-UseGCOverheadLimit -XX:-UseCompressedOops SPARK_DRIVER_MEMORY=120G The error I get is at the reducer above (the reducer above is in file called Learning.scala, line 313): 15/01/18 14:35:52 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:836) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 15/01/18 14:35:52 INFO DAGScheduler: Job 2 failed: reduce at Learning.scala:313, took 54.657239 s Exception in thread main org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428) at akka.actor.Actor$class.aroundPostStop(Actor.scala:475) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1375) at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210) at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) at akka.actor.ActorCell.terminate(ActorCell.scala:369) at
ExceptionInInitializerError when using a class defined in REPL
Hi experts, I'm getting ExceptionInInitializerError when using a class defined in REPL. Code is something like this: case class TEST(a: String) sc.textFile(~~~).map(TEST(_)).count The code above used to works well until yesterday, but suddenly for some reason it doesn't work with the error. Confirmed it still works with local mode. I'm getting headache while working into this problem during whole weekend. Any ideas? environment: aws ec2, s3 spark v1.1.1, hadoop 2.2 Attaching error logs: === 15/01/18 13:54:22 INFO TaskSetManager: Lost task 0.19 in stage 0.0 (TID 19) on executor ip-172-16-186-181.ap-northeast-1.compute.internal: java.lang.ExceptionInInitializerError (null) [duplicate 5] 15/01/18 13:54:22 ERROR TaskSetManager: Task 0 in stage 0.0 failed 20 times; aborting job 15/01/18 13:54:22 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/01/18 13:54:22 INFO TaskSchedulerImpl: Cancelling stage 0 15/01/18 13:54:22 INFO DAGScheduler: Failed to run first at console:45 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 20 times, most recent failure: Lost task 0.19 in stage 0.0 (TID 19, ip-172-16-186-181.ap-northeast-1.compute.internal): java.lang.ExceptionInInitializerError: $iwC.init(console:6) init(console:35) .init(console:39) .clinit(console) $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DailyStatsChartBuilder$.fromCsv(console:41) $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:43) $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:43) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$10.next(Iterator.scala:312) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1079) org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1079) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Trying to find where Spark persists RDDs when run with YARN
Hi, I am trying to find where Spark persists RDDs when we call the persist() api and executed under YARN. This is purely for understanding... In my driver program, I wait indefinitely, so as to avoid any clean up problems. In the actual job, I roughly do the following: JavaRDDString lines = context.textFile(args[0]); lines.persist(StorageLevel.DISK_ONLY()); lines.collect(); When run with local executor, I can see that the files (like rdd_1_0) are persisted under directories like /var/folders/mt/51srrjc15wl3n829qkgnh2dmgp/T/spark-local-20150118201458-6147/15. Where similarly can I find these under Yarn ? Thanks hemanth
Re: No Output
Updating the Spark version means setting up the entire cluster once more? Or can we update it in some other way? On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste the code? Also you can try updating your spark version. Thanks Best Regards On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark-1.0.0 in a single node cluster. When I run a job with small data set it runs perfectly but when I use a data set of 350 KB, no output is being produced and when I try to run it the second time it is giving me an exception telling that SparkContext was shut down. Can anyone help me on this? Thank You
How to create distributed matrixes from hive tables.
Hi, We have large datasets with data format for Spark MLLib matrix, but there are pre-computed by Hive and stored inside Hive, my question is can we create a distributed matrix such as IndexedRowMatrix directlly from Hive tables, avoiding reading data from Hive tables and feed them into an empty Matrix. Regards
Re: How to share a NonSerializable variable among tasks in the same worker node?
The singleton hack works very different in spark 1.2.0 (it does not work if the program has multiple map-reduce jobs in the same program). I guess there should be an official documentation on how to have each machine/node do an init step locally before executing any other instructions (e.g. loading locally a very big object once at the begining that can be used in all further map jobs that will be assigned to that worker). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tp11048p21219.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Avoid broacasting huge variables
Why do you say it does not work? The singleton pattern works the same as ever. It is not a pattern that involves Spark. On Jan 18, 2015 12:57 PM, octavian.ganea octavian.ga...@inf.ethz.ch wrote: The singleton hack works very different in spark 1.2.0 (it does not work if the program has multiple map-reduce jobs in the same program). I guess there should be an official documentation on how to have each machine/node do an init step locally before executing any other instructions (e.g. loading locally a very big object once at the begining that can be used in all further map jobs that will be assigned to that worker). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Avoid-broacasting-huge-variables-tp14696p21220.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Join DStream With Other Datasets
I think that this problem is not Spark-specific since you are simply side loading some data into memory. Therefore you do not need an answer that uses Spark. Simply load the data and then poll for an update each time it is accessed? Or some reasonable interval? This is just something you write in Java/Scala. On Jan 17, 2015 2:06 PM, Ji ZHANG zhangj...@gmail.com wrote: Hi, I want to join a DStream with some other dataset, e.g. join a click stream with a spam ip list. I can think of two possible solutions, one is use broadcast variable, and the other is use transform operation as is described in the manual. But the problem is the spam ip list will be updated outside of the spark streaming program, so how can it be noticed to reload the list? For broadcast variables, they are immutable. For transform operation, is it costly to reload the RDD on every batch? If it is, and I use RDD.persist(), does it mean I need to launch a thread to regularly unpersist it so that it can get the updates? Any ideas will be appreciated. Thanks. -- Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Join DStream With Other Datasets
Hi, After some experiments, there're three methods that work in this 'join DStream with other dataset which is updated periodically'. 1. Create an RDD in transform operation val words = ssc.socketTextStream(localhost, ).flatMap(_.split(_)) val filtered = words transform { rdd = val spam = ssc.sparkContext.textFile(spam.txt).collect.toSet rdd.filter(!spam(_)) } The caveat is 'spam.txt' will be read in every batch. 2. Use variable broadcast variable... var bc = ssc.sparkContext.broadcast(getSpam) val filtered = words.filter(!bc.value(_)) val pool = Executors.newSingleThreadScheduledExecutor pool.scheduleAtFixedRate(new Runnable { def run(): Unit = { val obc = bc bc = ssc.sparkContext.broadcast(getSpam) obc.unpersist } }, 0, 5, TimeUnit.SECONDS) I'm surprised to come up with this solution, but I don't like var, and the unpersist thing looks evil. 3. Use accumulator val spam = ssc.sparkContext.accumulableCollection(getSpam.to[mutable.HashSet]) val filtered = words.filter(!spam.value(_)) def run(): Unit = { spam.setValue(getSpam.to[mutable.HashSet]) } Now it looks less ugly... Anyway, I still hope there's a better solution. On Sun, Jan 18, 2015 at 2:12 AM, Jörn Franke jornfra...@gmail.com wrote: Can't you send a special event through spark streaming once the list is updated? So you have your normal events and a special reload event Le 17 janv. 2015 15:06, Ji ZHANG zhangj...@gmail.com a écrit : Hi, I want to join a DStream with some other dataset, e.g. join a click stream with a spam ip list. I can think of two possible solutions, one is use broadcast variable, and the other is use transform operation as is described in the manual. But the problem is the spam ip list will be updated outside of the spark streaming program, so how can it be noticed to reload the list? For broadcast variables, they are immutable. For transform operation, is it costly to reload the RDD on every batch? If it is, and I use RDD.persist(), does it mean I need to launch a thread to regularly unpersist it so that it can get the updates? Any ideas will be appreciated. Thanks. -- Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Directory / File Reading Patterns
I think that putting part of the data (only) in a filename is an anti-pattern, but we sometimes have to play these where they lie. You can list all the directory paths containing the CSV files, map them each to RDDs with textFile, transform the RDDs to include info from the path, and then simply union them. This should be pretty fine performance wise even. On Jan 17, 2015 11:48 PM, Steve Nunez snu...@hortonworks.com wrote: Hello Users, I’ve got a real-world use case that seems common enough that its pattern would be documented somewhere, but I can’t find any references to a simple solution. The challenge is that data is getting dumped into a directory structure, and that directory structure itself contains features that I need in my model. For example: bank_code Trader Day-1.csv Day-2.csv … Each CVS file contains a list of all the trades made by that individual each day. The problem is that the bank trader should be part of the feature set. I.e. We need the RDD to look like: (bank, trader, day, list-of-trades) Anyone got any elegant solutions for doing this? Cheers, - SteveN
Re: No Output
You can try increasing the parallelism, can you be more specific about the task that you are doing? May be pasting the piece of code would help. On 18 Jan 2015 13:22, Deep Pradhan pradhandeep1...@gmail.com wrote: The error in the log file says: *java.lang.OutOfMemoryError: GC overhead limit exceeded* with certain task ID and the error repeats for further task IDs. What could be the problem? On Sun, Jan 18, 2015 at 2:45 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Updating the Spark version means setting up the entire cluster once more? Or can we update it in some other way? On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste the code? Also you can try updating your spark version. Thanks Best Regards On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark-1.0.0 in a single node cluster. When I run a job with small data set it runs perfectly but when I use a data set of 350 KB, no output is being produced and when I try to run it the second time it is giving me an exception telling that SparkContext was shut down. Can anyone help me on this? Thank You
Re: ALS.trainImplicit running out of mem when using higher rank
OK. Are you sure the executor has the memory you think? -Xmx24g in its command line? It may be that for some reason your job is reserving an exceptionally large amount of non-heap memory. I am not sure that's to be expected with the ALS job though. Even if the settings work, considering using the explicit command line configuration. On Sat, Jan 17, 2015 at 12:49 PM, Antony Mayi antonym...@yahoo.com wrote: the values are for sure applied as expected - confirmed using the spark UI environment page... it comes from my defaults configured using 'spark.yarn.executor.memoryOverhead=8192' (yes, now increased even more) in /etc/spark/conf/spark-defaults.conf and 'export SPARK_EXECUTOR_MEMORY=24G' in /etc/spark/conf/spark-env.sh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Maven out of memory error
Oh: are you running the tests with a different profile setting than what the last assembly was built with? this particular test depends on those matching. Not 100% sure that's the problem, but a good guess. On Sat, Jan 17, 2015 at 4:54 PM, Ted Yu yuzhih...@gmail.com wrote: The test passed here: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/1215/consoleFull It passed locally with the following command: mvn -DHADOOP_PROFILE=hadoop-2.4 -Phadoop-2.4 -Pyarn -Phive test -Dtest=JavaAPISuite FYI On Sat, Jan 17, 2015 at 8:23 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Failing for me and another team member on the command line, for what it's worth. On Jan 17, 2015, at 2:39 AM, Sean Owen so...@cloudera.com wrote: Hm, this test hangs for me in IntelliJ. It could be a real problem, and a combination of a) just recently actually enabling Java tests, b) recent updates to the complicated Guava shading situation. The manifestation of the error usually suggests that something totally failed to start (because of, say, class incompatibility errors, etc.) Thus things hang and time out waiting for the dead component. It's sometimes hard to get answers from the embedded component that dies though. That said, it seems to pass on the command line. For example my recent Jenkins job shows it passes: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25682/consoleFull I'll try to uncover more later this weekend. Thoughts welcome though. On Fri, Jan 16, 2015 at 8:26 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Thanks Ted, got farther along but now have a failing test; is this a known issue? --- T E S T S --- Running org.apache.spark.JavaAPISuite Tests run: 72, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 123.462 sec FAILURE! - in org.apache.spark.JavaAPISuite testGuavaOptional(org.apache.spark.JavaAPISuite) Time elapsed: 106.5 sec ERROR! org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1399) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1360) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Running org.apache.spark.JavaJdbcRDDSuite Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.846 sec - in org.apache.spark.JavaJdbcRDDSuite Results : Tests in error: JavaAPISuite.testGuavaOptional » Spark Job aborted due to stage failure: Maste... On Fri, Jan 16, 2015 at 12:06 PM, Ted Yu yuzhih...@gmail.com wrote: Can you try doing this before running mvn ? export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m What OS are you using ? Cheers On Fri, Jan 16, 2015 at 12:03 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Just got the latest from Github and tried running `mvn test`; is this error common and do you have any advice on fixing it?
Re: maven doesn't build dependencies with Scala 2.11
I could be wrong, but I thought this was on purpose. At the time it was set up, there was no 2.11 Kafka available? or one of its dependencies wouldn't work with 2.11? But I'm not sure what the OP means by maven doesn't build Spark's dependencies because Ted indicates it does, and of course you can see that these artifacts are published. On Sun, Jan 18, 2015 at 2:46 AM, Ted Yu yuzhih...@gmail.com wrote: There're 3 jars under lib_managed/jars directory with and without -Dscala-2.11 flag. Difference between scala-2.10 and scala-2.11 profiles is that scala-2.10 profile has the following: modules moduleexternal/kafka/module /modules FYI On Sat, Jan 17, 2015 at 4:07 PM, Ted Yu yuzhih...@gmail.com wrote: I did the following: 1655 dev/change-version-to-2.11.sh 1657 mvn -DHADOOP_PROFILE=hadoop-2.4 -Pyarn,hive -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package And mvn command passed. Did you see any cross-compilation errors ? Cheers BTW the two links you mentioned are consistent in terms of building for Scala 2.11 On Sat, Jan 17, 2015 at 3:43 PM, Walrus theCat walrusthe...@gmail.com wrote: Hi, When I run this: dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package as per here, maven doesn't build Spark's dependencies. Only when I run: dev/change-version-to-2.11.sh sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package as gathered from here, do I get Spark's dependencies built without any cross-compilation errors. Question: - How can I make maven do this? - How can I specify the use of Scala 2.11 in my own .pom files? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script
Nathan, I posted a bunch of questions for you as a comment on your question http://stackoverflow.com/q/28002443/877069 on Stack Overflow. If you answer them (don't forget to @ping me) I may be able to help you. Nick On Sat Jan 17 2015 at 3:49:54 PM gen tang gen.tan...@gmail.com wrote: Hi, This is because ssh-ready is the ec2 scripy means that all the instances are in the status of running and all the instances in the status of OK, In another word, the instances is ready to download and to install software, just as emr is ready for bootstrap actions. Before, the script just repeatedly prints the information showing that we are waiting for every instance being launched.And it is quite ugly, so they change the information to print However, you can use ssh to connect the instance even if it is in the status of pending. If you wait patiently a little more,, the script will finish the launch of cluster. Cheers Gen On Sat, Jan 17, 2015 at 7:00 PM, Nathan Murthy nathan.mur...@gmail.com wrote: Originally posted here: http://stackoverflow.com/questions/28002443/cluster-hangs-in-ssh-ready-state-using-spark-1-2-ec2-launch-script I'm trying to launch a standalone Spark cluster using its pre-packaged EC2 scripts, but it just indefinitely hangs in an 'ssh-ready' state: ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k key-pair -i identity-file.pem -r us-west-2 -s 3 launch test Setting up security groups... Searching for existing cluster test... Spark AMI: ami-ae6e0d9e Launching instances... Launched 3 slaves in us-west-2c, regid = r-b___6 Launched master in us-west-2c, regid = r-0__0 Waiting for all instances in cluster to enter 'ssh-ready' state.. Yet I can SSH into these instances without compaint: ubuntu@machine:~$ ssh -i identity-file.pem root@master-ip Last login: Day MMM DD HH:mm:ss 20YY from c-AA-BBB--DDD.eee1.ff.provider.net __| __|_ ) _| ( / Amazon Linux AMI ___|\___|___| https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/ There are 59 security update(s) out of 257 total update(s) available Run sudo yum update to apply all updates. Amazon Linux version 2014.09 is available. root@ip-internal ~]$ I'm trying to figure out if this is a problem in AWS or with the Spark scripts. I've never had this issue before until recently. -- Nathan Murthy // 713.884.7110 (mobile) // @natemurthy
R: Spark Streaming with Kafka
I have the same issue. - Messaggio originale - Da: Rasika Pohankar rasikapohan...@gmail.com Inviato: 18/01/2015 18:48 A: user@spark.apache.org user@spark.apache.org Oggetto: Spark Streaming with Kafka I am using Spark Streaming to process data received through Kafka. The Spark version is 1.2.0. I have written the code in Java and am compiling it using sbt. The program runs and receives data from Kafka and processes it as well. But it stops receiving data suddenly after some time( it has run for an hour up till now while receiving data from Kafka and then always stopped receiving). The program continues to run, it only stops receiving data. After a while, sometimes it starts and sometimes doesn't. So I stop the program and start again. Earlier I was using Spark 1.0.0. Upgraded to check if the problem was in that version. But after upgrading also, it is happening. Is this a known issue? Can someone please help. Thanking you. View this message in context: Spark Streaming with Kafka Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Trying to find where Spark persists RDDs when run with YARN
These will be under the working directory of the YARN container running the executor. I don't have it handy but think it will also be a spark-local or similar directory. On Sun, Jan 18, 2015 at 2:50 PM, Hemanth Yamijala yhema...@gmail.com wrote: Hi, I am trying to find where Spark persists RDDs when we call the persist() api and executed under YARN. This is purely for understanding... In my driver program, I wait indefinitely, so as to avoid any clean up problems. In the actual job, I roughly do the following: JavaRDDString lines = context.textFile(args[0]); lines.persist(StorageLevel.DISK_ONLY()); lines.collect(); When run with local executor, I can see that the files (like rdd_1_0) are persisted under directories like /var/folders/mt/51srrjc15wl3n829qkgnh2dmgp/T/spark-local-20150118201458-6147/15. Where similarly can I find these under Yarn ? Thanks hemanth - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Row similarities
Right, done with matrix blocks. Seems like a lot of duplicate effort. but that’s the way of OSS sometimes. I didn’t see transpose in the Jira. Are there plans for transpose and rowSimilarity without transpose? The latter seems easier than columnSimilarity in the general/naive case. Thresholds could also be used there. A threshold for downsampling is going to be extremely hard to use in practice, but not sure what the threshold applies to so I must skimmed too fast. If the threshold were some number of sigmas it would be more usable and could extend to non-cosine similarities. Would add a non-trivial step to the calc, of course. But without it I’ve never seen a threshold actually used effectively and I’ve tried. It’s so dependent on the dataset. Observation as a question: One of the reasons I use the Mahout Spark R-like DSL is for row and column similarity (uses in cooccurrence recommenders) and you can count on a robust full linear algebra implementation. Once you have full linear algebra on top of an optimizer many of the factorization methods are dead simple. Seems like MLlib has chosen to do optimized higher level algorithms first instead of full linear algebra first? On Jan 17, 2015, at 6:27 PM, Reza Zadeh r...@databricks.com wrote: We're focused on providing block matrices, which makes transposition simple: https://issues.apache.org/jira/browse/SPARK-3434 https://issues.apache.org/jira/browse/SPARK-3434 On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a transpose—it’s optimized out. Mahout has had a stand alone row matrix transpose since day 1 and supports it in the Spark version. Can’t really do matrix algebra without it even though it’s often possible to optimize it away. Row similarity with LLR is much simpler than cosine since you only need non-zero sums for column, row, and matrix elements so rowSimilarity is implemented in Mahout for Spark. Full blown row similarity including all the different similarity methods (long since implemented in hadoop mapreduce) hasn’t been moved to spark yet. Yep, rows are not covered in the blog, my mistake. Too bad it has a lot of uses and can at very least be optimized for output matrix symmetry. On Jan 17, 2015, at 11:44 AM, Andrew Musselman andrew.mussel...@gmail.com mailto:andrew.mussel...@gmail.com wrote: Yeah okay, thanks. On Jan 17, 2015, at 11:15 AM, Reza Zadeh r...@databricks.com mailto:r...@databricks.com wrote: Pat, columnSimilarities is what that blog post is about, and is already part of Spark 1.2. rowSimilarities in a RowMatrix is a little more tricky because you can't transpose a RowMatrix easily, and is being tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 https://issues.apache.org/jira/browse/SPARK-4823 Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for example the number of rows in your RowMatrix is less than 1m, you can transpose it and use rowSimilarities. On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: BTW it looks like row and column similarities (cosine based) are coming to MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the master yet. Does anyone know the status? See: https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html Also the method for computation reduction (make it less than O(n^2)) seems rooted in cosine. A different computation reduction method is used in the Mahout code tied to LLR. Seems like we should get these together. On Jan 17, 2015, at 9:37 AM, Andrew Musselman andrew.mussel...@gmail.com mailto:andrew.mussel...@gmail.com wrote: Excellent, thanks Pat. On Jan 17, 2015, at 9:27 AM, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: Mahout’s Spark implementation of rowsimilarity is in the Scala SimilarityAnalysis class. It actually does either row or column similarity but only supports LLR at present. It does [AA’] for columns or [A’A] for rows first then calculates the distance (LLR) for non-zero elements. This is a major optimization for sparse matrices. As I recall the old hadoop code only did this for half the matrix since it’s symmetric but that optimization isn’t in the current code because the downsampling is done as LLR is calculated, so the entire similarity matrix is never actually calculated unless you disable downsampling. The primary use is for recommenders but I’ve used it (in the test suite) for row-wise text token similarity too. On Jan 17, 2015, at 9:00 AM, Andrew Musselman andrew.mussel...@gmail.com mailto:andrew.mussel...@gmail.com wrote: Yeah that's the kind of thing I'm looking
Spark Streaming with Kafka
I am using Spark Streaming to process data received through Kafka. The Spark version is 1.2.0. I have written the code in Java and am compiling it using sbt. The program runs and receives data from Kafka and processes it as well. But it stops receiving data suddenly after some time( it has run for an hour up till now while receiving data from Kafka and then always stopped receiving). The program continues to run, it only stops receiving data. After a while, sometimes it starts and sometimes doesn't. So I stop the program and start again. Earlier I was using Spark 1.0.0. Upgraded to check if the problem was in that version. But after upgrading also, it is happening. Is this a known issue? Can someone please help. Thanking you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-tp21222.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Recent Git Builds Application WebUI Problem and Exception Stating Log directory /tmp/spark-events does not exist.
I posted about the Application WebUI error (specifically application WebUI not the master WebUI generally) and have spent at least a few hours a day for over week trying to resolve it so I’d be very grateful for any suggestions. It is quite troubling that I appear to be the only one encountering this issue and I’ve tried to include everything here which might be relevant (sorry for the length). Please see the thread Current Build Gives HTTP ERROR” https://www.mail-archive.com/user@spark.apache.org/msg18752.html https://www.mail-archive.com/user@spark.apache.org/msg18752.html to see specifics about the application webUI issue and the master log. Environment: I’m doing my spark builds and application programming in scala locally on my macbook pro in eclipse, using modified ec2 launch scripts to launch my cluster, uploading my spark builds and models to s3, and uploading applications to and submitting them from ec2. I’m using java 8 locally and also installing and using java 8 on my ec2 instances (which works with spark 1.2.0). I have a windows machine at home (macbook is work machine), but have not yet attempted to launch from there. Errors: I’ve built two different recent git versions of spark both multiple times, and when running applications both have produced an Application WebUI error and this exception: Exception in thread main java.lang.IllegalArgumentException: Log directory /tmp/spark-events does not exist. While both will display the master webUI just fine including running/completed applications, registered workers etc, when I try to access a running or completed application’s WebUI by clicking their respective link, I receive a server error. When I manually create the above log directory, the exception goes away, but the WebUI problem does not. I don’t have any strong evidence, but I suspect these errors and whatever is causing them are related. Why and How of Modifications to Launch Scripts for Installation of Unreleased Spark Versions: When using a prebuilt version of spark on my cluster everything works except the new methods I need, which I had previously added to my custom version of spark and used by building the spark-assembly.jar locally and then replacing the assembly file produced through the 1.1.0 ec2 launch scripts. However, since my pull request was accepted and can now be found in the apache/spark repository along with some additional features I’d like to use and because I’d like a more elegant permanent solution for launching a cluster and installing unreleased versions of spark to my ec2 clusters, I’ve modified the included ec2 launch scripts in this way (credit to gen tang here: https://www.mail-archive.com/user%40spark.apache.org/msg18761.html https://www.mail-archive.com/user@spark.apache.org/msg18761.html): 1. Clone the most recent git version of spark 2. Use the make-dist script 3. Tar the dist folder and upload the resulting spark-1.3.0-snapshot-hadoop1.tgz to s3 and change file permissions 4. Fork the mesos/spark-ec2 repository and modify the spark/init.sh script to do a wget of my hosted distribution instead of spark’s stable release 5. Modify my spark_ec2.py script to point to my repository. 6. Modify my spark_ec2.py script to install java 8 on my ec2 instances. (This works and does not produce the above stated errors when using a stable release like 1.2.0). Additional Possibly Related Info: As far as I can tell (I went through line by line), when I launch my recent build vs when I launch the most recent stable release the console prints almost identical INFO and WARNINGS except where you would expect things to be different e.g. version numbers. I’ve noted that after launch the prebuilt stable version does not have a /tmp/spark-events directory, but it is created when the application is launched, while it is never created in my build. Further, in my unreleased builds the application logs that I find are always stored as .inprogress files (when I set the logging directory to /root/ or add the /tmp/spark-events directory manually) even after completion, which I believe is supposed to change to .completed (or something similar) when the application finishes. Thanks for any help!
Re: Maven out of memory error
Yes. That could be the cause. On Sun, Jan 18, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote: Oh: are you running the tests with a different profile setting than what the last assembly was built with? this particular test depends on those matching. Not 100% sure that's the problem, but a good guess. On Sat, Jan 17, 2015 at 4:54 PM, Ted Yu yuzhih...@gmail.com wrote: The test passed here: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/1215/consoleFull It passed locally with the following command: mvn -DHADOOP_PROFILE=hadoop-2.4 -Phadoop-2.4 -Pyarn -Phive test -Dtest=JavaAPISuite FYI On Sat, Jan 17, 2015 at 8:23 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Failing for me and another team member on the command line, for what it's worth. On Jan 17, 2015, at 2:39 AM, Sean Owen so...@cloudera.com wrote: Hm, this test hangs for me in IntelliJ. It could be a real problem, and a combination of a) just recently actually enabling Java tests, b) recent updates to the complicated Guava shading situation. The manifestation of the error usually suggests that something totally failed to start (because of, say, class incompatibility errors, etc.) Thus things hang and time out waiting for the dead component. It's sometimes hard to get answers from the embedded component that dies though. That said, it seems to pass on the command line. For example my recent Jenkins job shows it passes: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25682/consoleFull I'll try to uncover more later this weekend. Thoughts welcome though. On Fri, Jan 16, 2015 at 8:26 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Thanks Ted, got farther along but now have a failing test; is this a known issue? --- T E S T S --- Running org.apache.spark.JavaAPISuite Tests run: 72, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 123.462 sec FAILURE! - in org.apache.spark.JavaAPISuite testGuavaOptional(org.apache.spark.JavaAPISuite) Time elapsed: 106.5 sec ERROR! org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1399) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1360) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Running org.apache.spark.JavaJdbcRDDSuite Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.846 sec - in org.apache.spark.JavaJdbcRDDSuite Results : Tests in error: JavaAPISuite.testGuavaOptional » Spark Job aborted due to stage failure: Maste... On Fri, Jan 16, 2015 at 12:06 PM, Ted Yu yuzhih...@gmail.com wrote: Can you try doing this before running mvn ? export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m What OS