Re: Enormous EC2 price jump makes r3.large patch more important
Ah, right. So only the launch script has changed. Everything else is still essentially binary compatible? Well, that makes it too easy! Thanks! On Wed, Jun 18, 2014 at 2:35 PM, Patrick Wendell pwend...@gmail.com wrote: Actually you'll just want to clone the 1.0 branch then use the spark-ec2 script in there to launch your cluster. The --spark-git-repo flag is if you want to launch with a different version of Spark on the cluster. In your case you just need a different version of the launch script itself, which will be present in the 1.0 branch of Spark. - Patrick On Tue, Jun 17, 2014 at 9:29 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: I am about to spin up some new clusters, so I may give that a go... any special instructions for making them work? I assume I use the --spark-git-repo= option on the spark-ec2 command. Is it as easy as concatenating your string as the value? On cluster management GUIs... I've been looking around at Amabari, Datastax, Cloudera, OpsCenter etc. Not totally convinced by any of them yet. Anyone using a good one I should know about? I'm really beginning to lean in the direction of Cassandra as the distributed data store... On Wed, Jun 18, 2014 at 1:46 PM, Patrick Wendell pwend...@gmail.com wrote: By the way, in case it's not clear, I mean our maintenance branches: https://github.com/apache/spark/tree/branch-1.0 On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1 release soon (this patch being one of the main reasons), but if you are itching for this sooner, you can just checkout the head of branch-1.0 and you will be able to use r3.XXX instances. - Patrick On Tue, Jun 17, 2014 at 4:17 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Some people (me included) might have wondered why all our m1.large spot instances (in us-west-1) shut down a few hours ago... Simple reason: The EC2 spot price for Spark's default m1.large instances just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times. Probably something to do with world cup. So far this is just us-west-1, but prices have a tendency to equalize across centers as the days pass. Time to make backups and plans. m3 spot prices are still down at $0.02 (and being new, will be bypassed by older systems), so it would be REAAALLYY nice if there had been some progress on that issue. Let me know if I can help with testing and whatnot. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Enormous EC2 price jump makes r3.large patch more important
Some people (me included) might have wondered why all our m1.large spot instances (in us-west-1) shut down a few hours ago... Simple reason: The EC2 spot price for Spark's default m1.large instances just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times. Probably something to do with world cup. So far this is just us-west-1, but prices have a tendency to equalize across centers as the days pass. Time to make backups and plans. m3 spot prices are still down at $0.02 (and being new, will be bypassed by older systems), so it would be REAAALLYY nice if there had been some progress on that issue. Let me know if I can help with testing and whatnot. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Enormous EC2 price jump makes r3.large patch more important
I am about to spin up some new clusters, so I may give that a go... any special instructions for making them work? I assume I use the --spark-git-repo= option on the spark-ec2 command. Is it as easy as concatenating your string as the value? On cluster management GUIs... I've been looking around at Amabari, Datastax, Cloudera, OpsCenter etc. Not totally convinced by any of them yet. Anyone using a good one I should know about? I'm really beginning to lean in the direction of Cassandra as the distributed data store... On Wed, Jun 18, 2014 at 1:46 PM, Patrick Wendell pwend...@gmail.com wrote: By the way, in case it's not clear, I mean our maintenance branches: https://github.com/apache/spark/tree/branch-1.0 On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1 release soon (this patch being one of the main reasons), but if you are itching for this sooner, you can just checkout the head of branch-1.0 and you will be able to use r3.XXX instances. - Patrick On Tue, Jun 17, 2014 at 4:17 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Some people (me included) might have wondered why all our m1.large spot instances (in us-west-1) shut down a few hours ago... Simple reason: The EC2 spot price for Spark's default m1.large instances just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times. Probably something to do with world cup. So far this is just us-west-1, but prices have a tendency to equalize across centers as the days pass. Time to make backups and plans. m3 spot prices are still down at $0.02 (and being new, will be bypassed by older systems), so it would be REAAALLYY nice if there had been some progress on that issue. Let me know if I can help with testing and whatnot. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Best practise for 'Streaming' dumps?
I read it more carefully, and window() might actually work for some other stuff like logs. (assuming I can have multiple windows with entirely different attributes on a single stream..) Thanks for that! On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Yes.. but from what I understand that's a sliding window so for a window of (60) over (1) second DStreams, that would save the entire last minute of data once per second. That's more than I need. I think what I'm after is probably updateStateByKey... I want to mutate data structures (probably even graphs) as the stream comes in, but I also want that state to be persistent across restarts of the application, (Or parallel version of the app, if possible) So I'd have to save that structure occasionally and reload it as the primer on the next run. I was almost going to use HBase or Hive, but they seem to have been deprecated in 1.0.0? Or just late to the party? Also, I've been having trouble deleting hadoop directories.. the old two line examples don't seem to work anymore. I actually managed to fill up the worker instances (I gave them tiny EBS) and I think I crashed them. On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo lbust...@gmail.com wrote: Have you thought of using window? Gino B. On Jun 6, 2014, at 11:49 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: It's going well enough that this is a how should I in 1.0.0 rather than how do i question. So I've got data coming in via Streaming (twitters) and I want to archive/log it all. It seems a bit wasteful to generate a new HDFS file for each DStream, but also I want to guard against data loss from crashes, I suppose what I want is to let things build up into superbatches over a few minutes, and then serialize those to parquet files, or similar? Or do i? Do I count-down the number of DStreams, or does Spark have a preferred way of scheduling cron events? What's the best practise for keeping persistent data for a streaming app? (Across restarts) And to clean up on termination? -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Are scala.MatchError messages a problem?
I shut down my first (working) cluster and brought up a fresh one... and It's been a bit of a horror and I need to sleep now. Should I be worried about these errors? Or did I just have the old log4j.config tuned so I didn't see them? I 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming job 1402245172000 ms.2 scala.MatchError: 0101-01-10 (of class java.lang.String) at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218) at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217) at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) The error comes from this code, which seemed like a sensible way to match things: (The case cmd_plus(w) statement is generating the error,) val cmd_plus = [+]([\w]+).r val cmd_minus = [-]([\w]+).r // find command user tweets val commands = stream.map( status = ( status.getUser().getId(), status.getText() ) ).foreachRDD(rdd = { rdd.join(superusers).map( x = x._2._1 ).collect().foreach{ cmd = { 218: cmd match { case cmd_plus(w) = { ... } case cmd_minus(w) = { ... } } }} }) It seems a bit excessive for scala to throw exceptions because a regex didn't match. Something feels wrong.
Re: Are scala.MatchError messages a problem?
On Sun, Jun 8, 2014 at 10:00 AM, Nick Pentreath nick.pentre...@gmail.com wrote: When you use match, the match must be exhaustive. That is, a match error is thrown if the match fails. Ahh, right. That makes sense. Scala is applying its strong typing rules here instead of no ceremony... but isn't the idea that type errors should get picked up at compile time? I suppose the compiler can't tell there's not complete coverage, but it seems strange to throw that at runtime when it is literally the 'default case'. I think I need a good Scala Programming Guide... any suggestions? I've read and watch the usual resources and videos, but it feels like a shotgun approach and I've clearly missed a lot. On Mon, Jun 9, 2014 at 3:26 AM, Mark Hamstra m...@clearstorydata.com wrote: And you probably want to push down that filter into the cluster -- collecting all of the elements of an RDD only to not use or filter out some of them isn't an efficient usage of expensive (at least in terms of time/performance) network resources. There may also be a good opportunity to use the partial function form of collect to push even more processing into the cluster. I almost certainly do :-) And I am really looking forward to spending time optimizing the code, but I keep getting caught up on deployment issues, uberjars, missing /mnt/spark directories, only being able to submit from the master, and being thoroughly confused about sample code from three versions ago. I'm even thinking of learning maven, if it means I never have to use sbt again. Does it mean that? -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: New user streaming question
Yup, when it's running, DStream.print() will print out a timestamped block for every time step, even if the block is empty. (for v1.0.0, which I have running in the other window) If you're not getting that, I'd guess the stream hasn't started up properly. On Sat, Jun 7, 2014 at 11:50 AM, Michael Campbell michael.campb...@gmail.com wrote: I've been playing with spark and streaming and have a question on stream outputs. The symptom is I don't get any. I have run spark-shell and all does as I expect, but when I run the word-count example with streaming, it *works* in that things happen and there are no errors, but I never get any output. Am I understanding how it it is supposed to work correctly? Is the Dstream.print() method supposed to print the output for every (micro)batch of the streamed data? If that's the case, I'm not seeing it. I'm using the netcat example and the StreamingContext uses the network to read words, but as I said, nothing comes out. I tried changing the .print() to .saveAsTextFiles(), and I AM getting a file, but nothing is in it other than a _temporary subdir. I'm sure I'm confused here, but not sure where. Help? -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Best practise for 'Streaming' dumps?
It's going well enough that this is a how should I in 1.0.0 rather than how do i question. So I've got data coming in via Streaming (twitters) and I want to archive/log it all. It seems a bit wasteful to generate a new HDFS file for each DStream, but also I want to guard against data loss from crashes, I suppose what I want is to let things build up into superbatches over a few minutes, and then serialize those to parquet files, or similar? Or do i? Do I count-down the number of DStreams, or does Spark have a preferred way of scheduling cron events? What's the best practise for keeping persistent data for a streaming app? (Across restarts) And to clean up on termination? -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Can't seem to link external/twitter classes from my own app
I shan't be far. I'm committed now. Spark and I are going to have a very interesting future together, but hopefully future messages will be about the algorithms and modules, and less how do I run make?. I suspect doing this at the exact moment of the 0.9 - 1.0.0 transition hasn't helped me. (I literally had the documentation changing on me between page reloads last thursday, after days of studying the old version. I thought I was going crazy until the new version number appeared in the corner and the release email went out.) The last time I entered into a serious relationship with a piece of software like this was with a little company called Cognos. :-) And then Microsoft asked us for some advice about a thing called OLAP Server they were making. (But I don't think they listened as hard as they should have.) Oh, the things I'm going to do with Spark! If it hadn't existed, I would have had to make it. (My honors thesis was in distributed computing. I once created an incrementally compiled language that could pause execution, decompile, move to another machine, recompile, restore state and continue while preserving all active network connections. discuss.) On Thu, Jun 5, 2014 at 5:46 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Great - well we do hope we hear from you, since the user list is for interesting success stories and anecdotes, as well as blog posts etc too :) On Thu, Jun 5, 2014 at 9:40 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Oh. Yes of course. *facepalm* I'm sure I typed that at first, but at some point my fingers decided to grammar-check me. Stupid fingers. I wonder what sbt assemble does? (apart from error) It certainly takes a while to do it. Thanks for the maven offer, but I'm not scheduled to learn that until after Scala, streaming, graphx, mllib, HDFS, sbt, Python, and yarn. I'll probably need to know it for yarn, but I'm really hoping to put it off until then. (fortunately I already knew about linux, AWS, eclipse, git, java, distributed programming and ssh keyfiles, or I would have been in real trouble) Ha! OK, that worked for the Kafka project... fails on the other old 0.9 Twitter project, but who cares... now for mine HAHA! YES!! Oh thank you! I have the equivalent of hello world that uses one external library! Now the compiler and I can have a _proper_ conversation. Hopefully you won't be hearing from me for a while. On Thu, Jun 5, 2014 at 3:06 PM, Nick Pentreath nick.pentre...@gmail.com wrote: The magic incantation is sbt assembly (not assemble). Actually I find maven with their assembly plugins to be very easy (mvn package). I can send a Pom.xml for a skeleton project if you need — Sent from Mailbox https://www.dropbox.com/mailbox On Thu, Jun 5, 2014 at 6:59 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Hmm.. That's not working so well for me. First, I needed to add a project/plugin.sbt file with the contents: addSbtPlugin(com.eed3si9n % sbt-assembly % 0.11.4) Before 'sbt/sbt assemble' worked at all. And I'm not sure about that version number, but 0.9.1 isn't working much better and 11.4 is the latest one recommended by the sbt project site. Where did you get your version from? Second, even when I do get it to build a .jar, spark-submit is still telling me the external.twitter library is missing. I tried using your github project as-is, but it also complained about the missing plugin.. I'm trying it with various versions now to see if I can get that working, even though I don't know anything about kafka. Hmm, and no. Here's what I get: [info] Set current project to Simple Project (in build file:/home/ubuntu/spark-1.0.0/SparkKafka/) [error] Not a valid command: assemble [error] Not a valid project ID: assemble [error] Expected ':' (if selecting a configuration) [error] Not a valid key: assemble (similar: assembly, assemblyJarName, assemblyDirectory) [error] assemble [error] I also found this project which seemed to be exactly what I was after: https://github.com/prabeesh/SparkTwitterAnalysis ...but it was for Spark 0.9, and though I updated all the version references to 1.0.0, that one doesn't work either. I can't even get it to build. *sigh* Is it going to be easier to just copy the external/ source code into my own project? Because I will... especially if creating Uberjars takes this long every... single... time... On Thu, Jun 5, 2014 at 8:52 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Thanks Patrick! Uberjars. Cool. I'd actually heard of them. And thanks for the link to the example! I shall work through that today. I'm still learning sbt and it's many options... the last new framework I learned was node.js, and I think I've been rather spoiled by npm. At least it's not maven. Please, oh please don't make me learn maven too. (The only people who seem to like it have Software Stockholm Syndrome: I know maven kidnapped me
Twitter feed options?
Me again, Things have been going well, actually. I've got my build chain sorted, 1.0.0 and streaming is working reliably. I managed to turn off the INFO messages by messing with every log4j properties file on the system. :-) On thing I would like to try now is some natural language processing on some selected twitter streams. (ie: my own.) but the streaming example seems to be 'sipping from the firehose'. I'm combing through the twitter4j documentation now, but does anyone know a simple way of restricting the 'flood' to just my own timeline? Otherwise, yes, this is now the fun part! -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Twitter feed options?
Nope, sorry, nevermind! I looked at the source, and it was pretty obvious that it didn't implement that yet, so I've ripped the classes out and am mutating them into a new receivers right now... ... starting to get the hang of this. On Fri, Jun 6, 2014 at 1:07 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Me again, Things have been going well, actually. I've got my build chain sorted, 1.0.0 and streaming is working reliably. I managed to turn off the INFO messages by messing with every log4j properties file on the system. :-) On thing I would like to try now is some natural language processing on some selected twitter streams. (ie: my own.) but the streaming example seems to be 'sipping from the firehose'. I'm combing through the twitter4j documentation now, but does anyone know a simple way of restricting the 'flood' to just my own timeline? Otherwise, yes, this is now the fun part! -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Yay for 1.0.0! EC2 Still has problems.
On Wed, Jun 4, 2014 at 12:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Ah, sorry to hear you had more problems. Some thoughts on them: There will always be more problems, 'tis the nature of coding. :-) I try not to bother the list until I've smacked my head against them for a few hours, so it's only the most confusing stuff I pour out here. I'm actually progressing pretty well. (the streaming.Twitter ones especially) depend on there being a /mnt/spark and /mnt2/spark directory (I think for java tempfiles?) and those don't seem to exist out-of-the-box. I think this is a side-effect of the r3 instances not having those drives mounted. Our setup script would normally create these directories. What was the error? Oh, I went back to m1.large while those issues get sorted out. I decided I had enough problems without messing with that too. (seriously, why does Amazon do these things? It's like they _try_ to make the instances incompatible.) I forget the exact error, but it traced through createTempFile and it was fairly clear about the directory being missing. Things like bin/run-example SparkPi worked fine, but I'll bet twitter4j creates temp files, so bin/run-example streaming.TwitterPopularTags broke. What did you change log4j.properties to? It should be changed to say log4j.rootCategory=WARN, console but maybe another log4j.properties is somehow arriving on the classpath. This is definitely a common problem so we need to add some explicit docs on it. I seem to have this sorted out, don't ask me how. Once again I was probably editing things on the cluster master when I should have been editing the cluster controller, or vice versa. But, yeah, many of the examples just get lost in a sea of DAG INFO messages. Are you going through http://spark.apache.org/docs/latest/quick-start.html? You should be able to do just sbt package. Once you do that you don’t need to deploy your application’s JAR to the cluster, just pass it to spark-submit and it will automatically be sent over. Ah, that answers another question I just asked elsewhere... Yup, I re-read pretty much every documentation page daily. And I'm making my way through every video. Meanwhile I'm learning scala... Great Turing's Ghost, it's the dream language we've theorized about for years! I hadn't realized! Indeed, glad you’re enjoying it. Enjoying, not yet alas, I'm sure I'll get there. But I do understand the implications of a mixed functional-imperative language with closures and lambdas. That is serious voodoo. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Can't seem to link external/twitter classes from my own app
Thanks Patrick! Uberjars. Cool. I'd actually heard of them. And thanks for the link to the example! I shall work through that today. I'm still learning sbt and it's many options... the last new framework I learned was node.js, and I think I've been rather spoiled by npm. At least it's not maven. Please, oh please don't make me learn maven too. (The only people who seem to like it have Software Stockholm Syndrome: I know maven kidnapped me and beat me up, but if you spend long enough with it, you eventually start to sympathize and see it's point of view.) On Thu, Jun 5, 2014 at 3:39 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, The issue is that you are using one of the external libraries and these aren't actually packaged with Spark on the cluster, so you need to create an uber jar that includes them. You can look at the example here (I recently did this for a kafka project and the idea is the same): https://github.com/pwendell/kafka-spark-example You'll want to make an uber jar that includes these packages (run sbt assembly) and then submit that jar to spark-submit. Also, I'd try running it locally first (if you aren't already) just to make the debugging simpler. - Patrick On Wed, Jun 4, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote: Ah sorry, this may be the thing I learned for the day. The issue is that classes from that particular artifact are missing though. Worth interrogating the resulting .jar file with jar tf to see if it made it in? On Wed, Jun 4, 2014 at 2:12 PM, Nick Pentreath nick.pentre...@gmail.com wrote: @Sean, the %% syntax in SBT should automatically add the Scala major version qualifier (_2.10, _2.11 etc) for you, so that does appear to be correct syntax for the build. I seemed to run into this issue with some missing Jackson deps, and solved it by including the jar explicitly on the driver class path: bin/spark-submit --driver-class-path SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar --class SimpleApp SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar Seems redundant to me since I thought that the JAR as argument is copied to driver and made available. But this solved it for me so perhaps give it a try? On Wed, Jun 4, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote: Those aren't the names of the artifacts: http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-streaming-twitter_2.10%22 The name is spark-streaming-twitter_2.10 On Wed, Jun 4, 2014 at 1:49 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Man, this has been hard going. Six days, and I finally got a Hello World App working that I wrote myself. Now I'm trying to make a minimal streaming app based on the twitter examples, (running standalone right now while learning) and when running it like this: bin/spark-submit --class SimpleApp SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar I'm getting this error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ Which I'm guessing is because I haven't put in a dependency to external/twitter in the .sbt, but _how_? I can't find any docs on it. Here's my build file so far: simple.sbt -- name := Simple Project version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.0.0 libraryDependencies += org.apache.spark %% spark-streaming % 1.0.0 libraryDependencies += org.apache.spark %% spark-streaming-twitter % 1.0.0 libraryDependencies += org.twitter4j % twitter4j-stream % 3.0.3 resolvers += Akka Repository at http://repo.akka.io/releases/; -- I've tried a few obvious things like adding: libraryDependencies += org.apache.spark %% spark-external % 1.0.0 libraryDependencies += org.apache.spark %% spark-external-twitter % 1.0.0 because, well, that would match the naming scheme implied so far, but it errors. Also, I just realized I don't completely understand if: (a) the spark-submit command _sends_ the .jar to all the workers, or (b) the spark-submit commands sends a _job_ to the workers, which are supposed to already have the jar file installed (or in hdfs), or (c) the Context is supposed to list the jars to be distributed. (is that deprecated?) One part of the documentation says: Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar. but another says: application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. I suppose both could
Re: Why Scala?
safely ignored. On Thu, May 29, 2014 at 1:55 PM, Nick Chammas nicholas.cham...@gmail.com wrote: I recently discovered Hacker News and started reading through older posts about Scala https://hn.algolia.com/?q=scala#!/story/forever/0/scala. It looks like the language is fairly controversial on there, and it got me thinking. Scala appears to be the preferred language to work with in Spark, and Spark itself is written in Scala, right? I know that often times a successful project evolves gradually out of something small, and that the choice of programming language may not always have been made consciously at the outset. But pretending that it was, why is Scala the preferred language of Spark? Nick -- View this message in context: Why Scala? http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com http://nabble.com/. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Can't seem to link external/twitter classes from my own app
Hmm.. That's not working so well for me. First, I needed to add a project/plugin.sbt file with the contents: addSbtPlugin(com.eed3si9n % sbt-assembly % 0.11.4) Before 'sbt/sbt assemble' worked at all. And I'm not sure about that version number, but 0.9.1 isn't working much better and 11.4 is the latest one recommended by the sbt project site. Where did you get your version from? Second, even when I do get it to build a .jar, spark-submit is still telling me the external.twitter library is missing. I tried using your github project as-is, but it also complained about the missing plugin.. I'm trying it with various versions now to see if I can get that working, even though I don't know anything about kafka. Hmm, and no. Here's what I get: [info] Set current project to Simple Project (in build file:/home/ubuntu/spark-1.0.0/SparkKafka/) [error] Not a valid command: assemble [error] Not a valid project ID: assemble [error] Expected ':' (if selecting a configuration) [error] Not a valid key: assemble (similar: assembly, assemblyJarName, assemblyDirectory) [error] assemble [error] I also found this project which seemed to be exactly what I was after: https://github.com/prabeesh/SparkTwitterAnalysis ...but it was for Spark 0.9, and though I updated all the version references to 1.0.0, that one doesn't work either. I can't even get it to build. *sigh* Is it going to be easier to just copy the external/ source code into my own project? Because I will... especially if creating Uberjars takes this long every... single... time... On Thu, Jun 5, 2014 at 8:52 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Thanks Patrick! Uberjars. Cool. I'd actually heard of them. And thanks for the link to the example! I shall work through that today. I'm still learning sbt and it's many options... the last new framework I learned was node.js, and I think I've been rather spoiled by npm. At least it's not maven. Please, oh please don't make me learn maven too. (The only people who seem to like it have Software Stockholm Syndrome: I know maven kidnapped me and beat me up, but if you spend long enough with it, you eventually start to sympathize and see it's point of view.) On Thu, Jun 5, 2014 at 3:39 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, The issue is that you are using one of the external libraries and these aren't actually packaged with Spark on the cluster, so you need to create an uber jar that includes them. You can look at the example here (I recently did this for a kafka project and the idea is the same): https://github.com/pwendell/kafka-spark-example You'll want to make an uber jar that includes these packages (run sbt assembly) and then submit that jar to spark-submit. Also, I'd try running it locally first (if you aren't already) just to make the debugging simpler. - Patrick On Wed, Jun 4, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote: Ah sorry, this may be the thing I learned for the day. The issue is that classes from that particular artifact are missing though. Worth interrogating the resulting .jar file with jar tf to see if it made it in? On Wed, Jun 4, 2014 at 2:12 PM, Nick Pentreath nick.pentre...@gmail.com wrote: @Sean, the %% syntax in SBT should automatically add the Scala major version qualifier (_2.10, _2.11 etc) for you, so that does appear to be correct syntax for the build. I seemed to run into this issue with some missing Jackson deps, and solved it by including the jar explicitly on the driver class path: bin/spark-submit --driver-class-path SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar --class SimpleApp SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar Seems redundant to me since I thought that the JAR as argument is copied to driver and made available. But this solved it for me so perhaps give it a try? On Wed, Jun 4, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote: Those aren't the names of the artifacts: http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-streaming-twitter_2.10%22 The name is spark-streaming-twitter_2.10 On Wed, Jun 4, 2014 at 1:49 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Man, this has been hard going. Six days, and I finally got a Hello World App working that I wrote myself. Now I'm trying to make a minimal streaming app based on the twitter examples, (running standalone right now while learning) and when running it like this: bin/spark-submit --class SimpleApp SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar I'm getting this error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ Which I'm guessing is because I haven't put in a dependency to external/twitter in the .sbt, but _how_? I can't find any docs on it. Here's my build file so far: simple.sbt
Re: Yay for 1.0.0! EC2 Still has problems.
Thanks for that, Matei! I'll look at that once I get a spare moment. :-) If you like, I'll keep documenting my newbie problems and frustrations... perhaps it might make things easier for others. Another issue I seem to have found (now that I can get small clusters up): some of the examples (the streaming.Twitter ones especially) depend on there being a /mnt/spark and /mnt2/spark directory (I think for java tempfiles?) and those don't seem to exist out-of-the-box. I have to create those directories and use copy-dir to get them to the workers before those examples run. Much of the the last two days for me have been about failing to get any of my own code to work, except for in spark-shell. (which is very nice, btw) At first I tried editing the examples, because I took the documentation literally when it said Finally, Spark includes several samples in the examples directory (Scala, Java, Python). You can run them as follows: but of course didn't realize editing them is pointless because while the source is there, the code is actually pulled from a .jar elsewhere. Doh. (so obvious in hindsight) I couldn't even turn down the voluminous INFO messages to WARNs, no matter how many conf/log4j.properties files I edited or copy-dir'd. I'm sure there's a trick to that I'm not getting. Even trying to build SimpleApp I've run into the problem that all the documentation says to use sbt/sbt assemble, but sbt doesn't seem to be in the 1.0.0 pre-built packages that I downloaded. Ah... yes.. there it is in the source package. I suppose that means that in order to deploy any new code to the cluster, I've got to rebuild from source on my cluster controller. OK, I never liked that Amazon Linux AMI anyway. I'm going to start from scratch again with an Ubuntu 12.04 instance, hopefully that will be more auspicious... Meanwhile I'm learning scala... Great Turing's Ghost, it's the dream language we've theorized about for years! I hadn't realized! On Mon, Jun 2, 2014 at 12:05 PM, Matei Zaharia matei.zaha...@gmail.com wrote: FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track this. Matei On Jun 1, 2014, at 6:14 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Sort of.. there were two separate issues, but both related to AWS.. I've sorted the confusion about the Master/Worker AMI ... use the version chosen by the scripts. (and use the right instance type so the script can choose wisely) But yes, one also needs a launch machine to kick off the cluster, and for that I _also_ was using an Amazon instance... (made sense.. I have a team that will needs to do things as well, not just me) and I was just pointing out that if you use the most recommended by Amazon AMI (for your free micro instance, for example) you get python 2.6 and the ec2 scripts fail. That merely needs a line in the documentation saying use Ubuntu for your cluster controller, not Amazon Linux or somesuch. But yeah, for a newbie, it was hard working out when to use default or custom AMIs for various parts of the setup. On Mon, Jun 2, 2014 at 4:01 AM, Patrick Wendell pwend...@gmail.com wrote: Hey just to clarify this - my understanding is that the poster (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally launch spark-ec2 from my laptop. And he was looking for an AMI that had a high enough version of python. Spark-ec2 itself has a flag -a that allows you to give a specific AMI. This flag is just an internal tool that we use for testing when we spin new AMI's. Users can't set that to an arbitrary AMI because we tightly control things like the Java and OS versions, libraries, etc. On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: *sigh* OK, I figured it out. (Thank you Nick, for the hint) m1.large works, (I swear I tested that earlier and had similar issues... ) It was my obsession with starting r3.*large instances. Clearly I hadn't patched the script in all the places.. which I think caused it to default to the Amazon AMI. I'll have to take a closer look at the code and see if I can't fix it correctly, because I really, really do want nodes with 2x the CPU and 4x the memory for the same low spot price. :-) I've got a cluster up now, at least. Time for the fun stuff... Thanks everyone for the help! On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지: Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한
Re: Spark on EC2
Hmm.. you've gotten further than me. Which AMI's are you using? On Sun, Jun 1, 2014 at 2:21 PM, superback andrew.matrix.c...@gmail.com wrote: Hi, I am trying to run an example on AMAZON EC2 and have successfully set up one cluster with two nodes on EC2. However, when I was testing an example using the following command, * ./run-example org.apache.spark.examples.GroupByTest spark://`hostname`:7077* I got the following warnings and errors. Can anyone help one solve this problem? Thanks very much! 46781 [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 61544 [spark-akka.actor.default-dispatcher-3] ERROR org.apache.spark.deploy.client.AppClient$ClientActor - All masters are unresponsive! Giving up. 61544 [spark-akka.actor.default-dispatcher-3] ERROR org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - Spark cluster looks dead, giving up. 61546 [spark-akka.actor.default-dispatcher-3] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Remove TaskSet 0.0 from pool 61549 [main] INFO org.apache.spark.scheduler.DAGScheduler - Failed to run count at GroupByTest.scala:50 Exception in thread main org.apache.spark.SparkException: Job aborted: Spark cluster looks down at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Yay for 1.0.0! EC2 Still has problems.
*sigh* OK, I figured it out. (Thank you Nick, for the hint) m1.large works, (I swear I tested that earlier and had similar issues... ) It was my obsession with starting r3.*large instances. Clearly I hadn't patched the script in all the places.. which I think caused it to default to the Amazon AMI. I'll have to take a closer look at the code and see if I can't fix it correctly, because I really, really do want nodes with 2x the CPU and 4x the memory for the same low spot price. :-) I've got a cluster up now, at least. Time for the fun stuff... Thanks everyone for the help! On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지: Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지: It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Trouble with EC2
Ha yes,,, I just went through this. (a) You have to use the ;'default' spark AMI, ( ami-7a320f3f at the moment ) and not any of the other linux distros. They don't work. (b) Start with m1.large instances.. I tried going for r3.large at first, and had no end of self-caused trouble. m1.large works. (c) It's possible for the script to choose the wrong AMI, especially if one has been messing with it to allow other instance types. (ahem) But it will work in the end.. just start simple. (yeah, I know m1.large doesn't look that large anymore. :-) On Mon, Jun 2, 2014 at 8:11 AM, PJ$ p...@chickenandwaffl.es wrote: Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't gotten any further. No clue what's wrong. I'd really appreciate any guidance y'all could offer. Best, PJ$ On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia matei.zaha...@gmail.com wrote: What instance types did you launch on? Sometimes you also get a bad individual machine from EC2. It might help to remove the node it’s complaining about from the conf/slaves file. Matei On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote: Hey Folks, I'm really having quite a bit of trouble getting spark running on ec2. I'm not using scripts the https://github.com/apache/spark/tree/master/ec2 because I'd like to know how everything works. But I'm going a little crazy. I think that something about the networking configuration must be messed up, but I'm at a loss. Shortly after starting the cluster, I get a lot of this: 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [ akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [ akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [ akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Yay for 1.0.0! EC2 Still has problems.
Sort of.. there were two separate issues, but both related to AWS.. I've sorted the confusion about the Master/Worker AMI ... use the version chosen by the scripts. (and use the right instance type so the script can choose wisely) But yes, one also needs a launch machine to kick off the cluster, and for that I _also_ was using an Amazon instance... (made sense.. I have a team that will needs to do things as well, not just me) and I was just pointing out that if you use the most recommended by Amazon AMI (for your free micro instance, for example) you get python 2.6 and the ec2 scripts fail. That merely needs a line in the documentation saying use Ubuntu for your cluster controller, not Amazon Linux or somesuch. But yeah, for a newbie, it was hard working out when to use default or custom AMIs for various parts of the setup. On Mon, Jun 2, 2014 at 4:01 AM, Patrick Wendell pwend...@gmail.com wrote: Hey just to clarify this - my understanding is that the poster (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally launch spark-ec2 from my laptop. And he was looking for an AMI that had a high enough version of python. Spark-ec2 itself has a flag -a that allows you to give a specific AMI. This flag is just an internal tool that we use for testing when we spin new AMI's. Users can't set that to an arbitrary AMI because we tightly control things like the Java and OS versions, libraries, etc. On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: *sigh* OK, I figured it out. (Thank you Nick, for the hint) m1.large works, (I swear I tested that earlier and had similar issues... ) It was my obsession with starting r3.*large instances. Clearly I hadn't patched the script in all the places.. which I think caused it to default to the Amazon AMI. I'll have to take a closer look at the code and see if I can't fix it correctly, because I really, really do want nodes with 2x the CPU and 4x the memory for the same low spot price. :-) I've got a cluster up now, at least. Time for the fun stuff... Thanks everyone for the help! On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지: Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지: It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Yay for 1.0.0! EC2 Still has problems.
Hi there, Patrick. Thanks for the reply... It wouldn't surprise me that AWS Ubuntu has Python 2.7. Ubuntu is cool like that. :-) Alas, the Amazon Linux AMI (2014.03.1) does not, and it's the very first one on the recommended instance list. (Ubuntu is #4, after Amazon, RedHat, SUSE) So, users such as myself who deliberately pick the Most Amazon-ish obvious first choice find they picked the wrong one. But that's trivial compared to the failure of the cluster to come up, apparently due to the master's http configuration. Any help on that would be much appreciated... it's giving me serious grief. On Sat, May 31, 2014 at 1:37 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Jeremy, That's interesting, I don't think anyone has ever reported an issue running these scripts due to Python incompatibility, but they may require Python 2.7+. I regularly run them from the AWS Ubuntu 12.04 AMI... that might be a good place to start. But if there is a straightforward way to make them compatible with 2.6 we should do that. For r3.large, we can add that to the script. It's a newer type. Any interest in contributing this? - Patrick On May 30, 2014 5:08 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Hi there! I'm relatively new to the list, so sorry if this is a repeat: I just wanted to mention there are still problems with the EC2 scripts. Basically, they don't work. First, if you run the scripts on Amazon's own suggested version of linux, they break because amazon installs Python2.6.9, and the scripts use a couple of Python2.7 commands. I have to sudo yum install python27, and then edit the spark-ec2 shell script to use that specific version. Annoying, but minor. (the base python command isn't upgraded to 2.7 on many systems, apparently because it would break yum) The second minor problem is that the script doesn't know about the r3.large servers... also easily fixed by adding to the spark_ec2.py script. Minor, The big problem is that after the EC2 cluster is provisioned, installed, set up, and everything, it fails to start up the webserver on the master. Here's the tail of the log: Starting GANGLIA gmond:[ OK ] Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Connection to ec2-54-183-82-48.us-west-1.compute.amazonaws.com closed. Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Connection to ec2-54-183-82-24.us-west-1.compute.amazonaws.com closed. Shutting down GANGLIA gmetad: [FAILED] Starting GANGLIA gmetad: [ OK ] Stopping httpd:[FAILED] Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory [FAILED] Basically, the AMI you have chosen does not seem to have a full install of apache, and is missing several modules that are referred to in the httpd.conf file that is installed. The full list of missing modules is: authn_alias_module modules/mod_authn_alias.so authn_default_module modules/mod_authn_default.so authz_default_module modules/mod_authz_default.so ldap_module modules/mod_ldap.so authnz_ldap_module modules/mod_authnz_ldap.so disk_cache_module modules/mod_disk_cache.so Alas, even if these modules are commented out, the server still fails to start. root@ip-172-31-11-193 ~]$ service httpd start Starting httpd: AH00534: httpd: Configuration error: No MPM loaded. That means Spark 1.0.0 clusters on EC2 are Dead-On-Arrival when run according to the instructions. Sorry. Any suggestions on how to proceed? I'll keep trying to fix the webserver, but (a) changes to httpd.conf get blown away by resume, and (b) anything I do has to be redone every time I provision another cluster. Ugh. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Yay for 1.0.0! EC2 Still has problems.
It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails. Have I missed something? What AMI's are people using? I've just gone back through the archives, and I'm seeing a lot of I can't get EC2 to work and not a single My EC2 has post-install issues, The quickstart page says ...can have a spark cluster up and running in five minutes. But it's been three days for me so far. I'm about to bite the bullet and start building my own AMI's from scratch... if anyone can save me from that, I'd be most grateful. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers