Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-18 Thread Jeremy Lee
Ah, right. So only the launch script has changed. Everything else is still
essentially binary compatible?

Well, that makes it too easy! Thanks!


On Wed, Jun 18, 2014 at 2:35 PM, Patrick Wendell pwend...@gmail.com wrote:

 Actually you'll just want to clone the 1.0 branch then use the
 spark-ec2 script in there to launch your cluster. The --spark-git-repo
 flag is if you want to launch with a different version of Spark on the
 cluster. In your case you just need a different version of the launch
 script itself, which will be present in the 1.0 branch of Spark.

 - Patrick

 On Tue, Jun 17, 2014 at 9:29 PM, Jeremy Lee
 unorthodox.engine...@gmail.com wrote:
  I am about to spin up some new clusters, so I may give that a go... any
  special instructions for making them work? I assume I use the 
  --spark-git-repo= option on the spark-ec2 command. Is it as easy as
  concatenating your string as the value?
 
  On cluster management GUIs... I've been looking around at Amabari,
 Datastax,
  Cloudera, OpsCenter etc. Not totally convinced by any of them yet. Anyone
  using a good one I should know about? I'm really beginning to lean in the
  direction of Cassandra as the distributed data store...
 
 
  On Wed, Jun 18, 2014 at 1:46 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  By the way, in case it's not clear, I mean our maintenance branches:
 
  https://github.com/apache/spark/tree/branch-1.0
 
  On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com
  wrote:
   Hey Jeremy,
  
   This is patched in the 1.0 and 0.9 branches of Spark. We're likely to
   make a 1.0.1 release soon (this patch being one of the main reasons),
   but if you are itching for this sooner, you can just checkout the head
   of branch-1.0 and you will be able to use r3.XXX instances.
  
   - Patrick
  
   On Tue, Jun 17, 2014 at 4:17 PM, Jeremy Lee
   unorthodox.engine...@gmail.com wrote:
   Some people (me included) might have wondered why all our m1.large
 spot
   instances (in us-west-1) shut down a few hours ago...
  
   Simple reason: The EC2 spot price for Spark's default m1.large
   instances
   just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times.
   Probably
   something to do with world cup.
  
   So far this is just us-west-1, but prices have a tendency to equalize
   across
   centers as the days pass. Time to make backups and plans.
  
   m3 spot prices are still down at $0.02 (and being new, will be
   bypassed by
   older systems), so it would be REAAALLYY nice if there had been some
   progress on that issue. Let me know if I can help with testing and
   whatnot.
  
  
   --
   Jeremy Lee  BCompSci(Hons)
 The Unorthodox Engineers
 
 
 
 
  --
  Jeremy Lee  BCompSci(Hons)
The Unorthodox Engineers




-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Jeremy Lee
Some people (me included) might have wondered why all our m1.large spot
instances (in us-west-1) shut down a few hours ago...

Simple reason: The EC2 spot price for Spark's default m1.large instances
just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times. Probably
something to do with world cup.

So far this is just us-west-1, but prices have a tendency to equalize
across centers as the days pass. Time to make backups and plans.

m3 spot prices are still down at $0.02 (and being new, will be bypassed
by older systems), so it would be REAAALLYY nice if there had been some
progress on that issue. Let me know if I can help with testing and whatnot.


-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Jeremy Lee
I am about to spin up some new clusters, so I may give that a go... any
special instructions for making them work? I assume I use the
 --spark-git-repo= option on the spark-ec2 command. Is it as easy as
concatenating your string as the value?

On cluster management GUIs... I've been looking around at Amabari,
Datastax, Cloudera, OpsCenter etc. Not totally convinced by any of them
yet. Anyone using a good one I should know about? I'm really beginning to
lean in the direction of Cassandra as the distributed data store...


On Wed, Jun 18, 2014 at 1:46 PM, Patrick Wendell pwend...@gmail.com wrote:

 By the way, in case it's not clear, I mean our maintenance branches:

 https://github.com/apache/spark/tree/branch-1.0

 On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Jeremy,
 
  This is patched in the 1.0 and 0.9 branches of Spark. We're likely to
  make a 1.0.1 release soon (this patch being one of the main reasons),
  but if you are itching for this sooner, you can just checkout the head
  of branch-1.0 and you will be able to use r3.XXX instances.
 
  - Patrick
 
  On Tue, Jun 17, 2014 at 4:17 PM, Jeremy Lee
  unorthodox.engine...@gmail.com wrote:
  Some people (me included) might have wondered why all our m1.large spot
  instances (in us-west-1) shut down a few hours ago...
 
  Simple reason: The EC2 spot price for Spark's default m1.large
 instances
  just jumped from 0.016 per hour, to about 0.750. Yes, Fifty times.
 Probably
  something to do with world cup.
 
  So far this is just us-west-1, but prices have a tendency to equalize
 across
  centers as the days pass. Time to make backups and plans.
 
  m3 spot prices are still down at $0.02 (and being new, will be
 bypassed by
  older systems), so it would be REAAALLYY nice if there had been some
  progress on that issue. Let me know if I can help with testing and
 whatnot.
 
 
  --
  Jeremy Lee  BCompSci(Hons)
The Unorthodox Engineers




-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Best practise for 'Streaming' dumps?

2014-06-08 Thread Jeremy Lee
I read it more carefully, and window() might actually work for some other
stuff like logs. (assuming I can have multiple windows with entirely
different attributes on a single stream..)

Thanks for that!


On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee unorthodox.engine...@gmail.com
wrote:

 Yes.. but from what I understand that's a sliding window so for a window
 of (60) over (1) second DStreams, that would save the entire last minute of
 data once per second. That's more than I need.

 I think what I'm after is probably updateStateByKey... I want to mutate
 data structures (probably even graphs) as the stream comes in, but I also
 want that state to be persistent across restarts of the application, (Or
 parallel version of the app, if possible) So I'd have to save that
 structure occasionally and reload it as the primer on the next run.

 I was almost going to use HBase or Hive, but they seem to have been
 deprecated in 1.0.0? Or just late to the party?

 Also, I've been having trouble deleting hadoop directories.. the old two
 line examples don't seem to work anymore. I actually managed to fill up
 the worker instances (I gave them tiny EBS) and I think I crashed them.



 On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo lbust...@gmail.com wrote:

 Have you thought of using window?

 Gino B.

  On Jun 6, 2014, at 11:49 PM, Jeremy Lee unorthodox.engine...@gmail.com
 wrote:
 
 
  It's going well enough that this is a how should I in 1.0.0 rather
 than how do i question.
 
  So I've got data coming in via Streaming (twitters) and I want to
 archive/log it all. It seems a bit wasteful to generate a new HDFS file for
 each DStream, but also I want to guard against data loss from crashes,
 
  I suppose what I want is to let things build up into superbatches
 over a few minutes, and then serialize those to parquet files, or similar?
 Or do i?
 
  Do I count-down the number of DStreams, or does Spark have a preferred
 way of scheduling cron events?
 
  What's the best practise for keeping persistent data for a streaming
 app? (Across restarts) And to clean up on termination?
 
 
  --
  Jeremy Lee  BCompSci(Hons)
The Unorthodox Engineers




 --
 Jeremy Lee  BCompSci(Hons)
   The Unorthodox Engineers




-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Are scala.MatchError messages a problem?

2014-06-08 Thread Jeremy Lee
I shut down my first (working) cluster and brought up a fresh one... and
It's been a bit of a horror and I need to sleep now. Should I be worried
about these errors? Or did I just have the old log4j.config tuned so I
didn't see them?

I

14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming
job 1402245172000 ms.2
scala.MatchError: 0101-01-10 (of class java.lang.String)
at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


The error comes from this code, which seemed like a sensible way to match
things:
(The case cmd_plus(w) statement is generating the error,)

val cmd_plus = [+]([\w]+).r
val cmd_minus = [-]([\w]+).r
// find command user tweets
val commands = stream.map(
status = ( status.getUser().getId(), status.getText() )
).foreachRDD(rdd = {
rdd.join(superusers).map(
x = x._2._1
).collect().foreach{ cmd = {
218:  cmd match {
case cmd_plus(w) = {
...
} case cmd_minus(w) = { ... } } }} })

It seems a bit excessive for scala to throw exceptions because a regex
didn't match. Something feels wrong.


Re: Are scala.MatchError messages a problem?

2014-06-08 Thread Jeremy Lee
On Sun, Jun 8, 2014 at 10:00 AM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

 When you use match, the match must be exhaustive. That is, a match error
 is thrown if the match fails.


Ahh, right. That makes sense. Scala is applying its strong typing rules
here instead of no ceremony... but isn't the idea that type errors should
get picked up at compile time? I suppose the compiler can't tell there's
not complete coverage, but it seems strange to throw that at runtime when
it is literally the 'default case'.

I think I need a good Scala Programming Guide... any suggestions? I've
read and watch the usual resources and videos, but it feels like a shotgun
approach and I've clearly missed a lot.

On Mon, Jun 9, 2014 at 3:26 AM, Mark Hamstra m...@clearstorydata.com
wrote:

 And you probably want to push down that filter into the cluster --
 collecting all of the elements of an RDD only to not use or filter out some
 of them isn't an efficient usage of expensive (at least in terms of
 time/performance) network resources.  There may also be a good opportunity
 to use the partial function form of collect to push even more processing
 into the cluster.


I almost certainly do :-) And I am really looking forward to spending time
optimizing the code, but I keep getting caught up on deployment issues,
uberjars, missing /mnt/spark directories, only being able to submit from
the master, and being thoroughly confused about sample code from three
versions ago.

I'm even thinking of learning maven, if it means I never have to use sbt
again. Does it mean that?

-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: New user streaming question

2014-06-06 Thread Jeremy Lee
Yup, when it's running, DStream.print() will print out a timestamped block
for every time step, even if the block is empty. (for v1.0.0, which I have
running in the other window)

If you're not getting that, I'd guess the stream hasn't started up
properly.


On Sat, Jun 7, 2014 at 11:50 AM, Michael Campbell 
michael.campb...@gmail.com wrote:

 I've been playing with spark and streaming and have a question on stream
 outputs.  The symptom is I don't get any.

 I have run spark-shell and all does as I expect, but when I run the
 word-count example with streaming, it *works* in that things happen and
 there are no errors, but I never get any output.

 Am I understanding how it it is supposed to work correctly?  Is the
 Dstream.print() method supposed to print the output for every (micro)batch
 of the streamed data?  If that's the case, I'm not seeing it.

 I'm using the netcat example and the StreamingContext uses the network
 to read words, but as I said, nothing comes out.

 I tried changing the .print() to .saveAsTextFiles(), and I AM getting a
 file, but nothing is in it other than a _temporary subdir.

 I'm sure I'm confused here, but not sure where.  Help?




-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Best practise for 'Streaming' dumps?

2014-06-06 Thread Jeremy Lee
It's going well enough that this is a how should I in 1.0.0 rather than
how do i question.

So I've got data coming in via Streaming (twitters) and I want to
archive/log it all. It seems a bit wasteful to generate a new HDFS file for
each DStream, but also I want to guard against data loss from crashes,

I suppose what I want is to let things build up into superbatches over a
few minutes, and then serialize those to parquet files, or similar? Or do i?

Do I count-down the number of DStreams, or does Spark have a preferred way
of scheduling cron events?

What's the best practise for keeping persistent data for a streaming app?
(Across restarts) And to clean up on termination?


-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Can't seem to link external/twitter classes from my own app

2014-06-05 Thread Jeremy Lee
I shan't be far. I'm committed now. Spark and I are going to have a very
interesting future together, but hopefully future messages will be about
the algorithms and modules, and less how do I run make?.

I suspect doing this at the exact moment of the 0.9 - 1.0.0 transition
hasn't helped me. (I literally had the documentation changing on me between
page reloads last thursday, after days of studying the old version. I
thought I was going crazy until the new version number appeared in the
corner and the release email went out.)

The last time I entered into a serious relationship with a piece of
software like this was with a little company called Cognos. :-) And then
Microsoft asked us for some advice about a thing called OLAP Server they
were making. (But I don't think they listened as hard as they should have.)

Oh, the things I'm going to do with Spark! If it hadn't existed, I would
have had to make it.

(My honors thesis was in distributed computing. I once created an
incrementally compiled language that could pause execution, decompile, move
to another machine, recompile, restore state and continue while preserving
all active network connections. discuss.)




On Thu, Jun 5, 2014 at 5:46 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 Great - well we do hope we hear from you, since the user list is for
 interesting success stories and anecdotes, as well as blog posts etc too :)


 On Thu, Jun 5, 2014 at 9:40 AM, Jeremy Lee unorthodox.engine...@gmail.com
  wrote:

 Oh. Yes of course. *facepalm*

 I'm sure I typed that at first, but at some point my fingers decided to
 grammar-check me. Stupid fingers. I wonder what sbt assemble does? (apart
 from error) It certainly takes a while to do it.

 Thanks for the maven offer, but I'm not scheduled to learn that until
 after Scala, streaming, graphx, mllib, HDFS, sbt, Python, and yarn. I'll
 probably need to know it for yarn, but I'm really hoping to put it off
 until then. (fortunately I already knew about linux, AWS, eclipse, git,
 java, distributed programming and ssh keyfiles, or I would have been in
 real trouble)

 Ha! OK, that worked for the Kafka project... fails on the other old 0.9
 Twitter project, but who cares... now for mine

 HAHA! YES!! Oh thank you! I have the equivalent of hello world that
 uses one external library! Now the compiler and I can have a _proper_
 conversation.

 Hopefully you won't be hearing from me for a while.



 On Thu, Jun 5, 2014 at 3:06 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

 The magic incantation is sbt assembly (not assemble).

 Actually I find maven with their assembly plugins to be very easy (mvn
 package). I can send a Pom.xml for a skeleton project if you need
 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Thu, Jun 5, 2014 at 6:59 AM, Jeremy Lee 
 unorthodox.engine...@gmail.com wrote:

 Hmm.. That's not working so well for me. First, I needed to add a
 project/plugin.sbt file with the contents:

 addSbtPlugin(com.eed3si9n % sbt-assembly % 0.11.4)

 Before 'sbt/sbt assemble' worked at all. And I'm not sure about that
 version number, but 0.9.1 isn't working much better and 11.4 is the
 latest one recommended by the sbt project site. Where did you get your
 version from?

 Second, even when I do get it to build a .jar, spark-submit is still
 telling me the external.twitter library is missing.

 I tried using your github project as-is, but it also complained about
 the missing plugin.. I'm trying it with various versions now to see if I
 can get that working, even though I don't know anything about kafka. Hmm,
 and no. Here's what I get:

  [info] Set current project to Simple Project (in build
 file:/home/ubuntu/spark-1.0.0/SparkKafka/)
 [error] Not a valid command: assemble
 [error] Not a valid project ID: assemble
 [error] Expected ':' (if selecting a configuration)
 [error] Not a valid key: assemble (similar: assembly, assemblyJarName,
 assemblyDirectory)
 [error] assemble
 [error]

 I also found this project which seemed to be exactly what I was after:
  https://github.com/prabeesh/SparkTwitterAnalysis

 ...but it was for Spark 0.9, and though I updated all the version
 references to 1.0.0, that one doesn't work either. I can't even get it to
 build.

 *sigh*

 Is it going to be easier to just copy the external/ source code into my
 own project? Because I will... especially if creating Uberjars takes this
 long every... single... time...



 On Thu, Jun 5, 2014 at 8:52 AM, Jeremy Lee 
 unorthodox.engine...@gmail.com wrote:

 Thanks Patrick!

 Uberjars. Cool. I'd actually heard of them. And thanks for the link to
 the example! I shall work through that today.

 I'm still learning sbt and it's many options... the last new framework
 I learned was node.js, and I think I've been rather spoiled by npm.

 At least it's not maven. Please, oh please don't make me learn maven
 too. (The only people who seem to like it have Software Stockholm 
 Syndrome:
 I know maven kidnapped me

Twitter feed options?

2014-06-05 Thread Jeremy Lee
Me again,

Things have been going well, actually. I've got my build chain sorted,
1.0.0 and streaming is working reliably. I managed to turn off the INFO
messages by messing with every log4j properties file on the system. :-)

On thing I would like to try now is some natural language processing on
some selected twitter streams. (ie: my own.) but the streaming example
seems to be 'sipping from the firehose'. I'm combing through the twitter4j
documentation now, but does anyone know a simple way of restricting the
'flood' to just my own timeline?

Otherwise, yes, this is now the fun part!

-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Twitter feed options?

2014-06-05 Thread Jeremy Lee
Nope, sorry, nevermind!

I looked at the source, and it was pretty obvious that it didn't implement
that yet, so I've ripped the classes out and am mutating them into a new
receivers right now...

... starting to get the hang of this.


On Fri, Jun 6, 2014 at 1:07 PM, Jeremy Lee unorthodox.engine...@gmail.com
wrote:


 Me again,

 Things have been going well, actually. I've got my build chain sorted,
 1.0.0 and streaming is working reliably. I managed to turn off the INFO
 messages by messing with every log4j properties file on the system. :-)

 On thing I would like to try now is some natural language processing on
 some selected twitter streams. (ie: my own.) but the streaming example
 seems to be 'sipping from the firehose'. I'm combing through the twitter4j
 documentation now, but does anyone know a simple way of restricting the
 'flood' to just my own timeline?

 Otherwise, yes, this is now the fun part!

 --
 Jeremy Lee  BCompSci(Hons)
   The Unorthodox Engineers




-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-04 Thread Jeremy Lee
On Wed, Jun 4, 2014 at 12:31 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Ah, sorry to hear you had more problems. Some thoughts on them:


There will always be more problems, 'tis the nature of coding. :-) I try
not to bother the list until I've smacked my head against them for a few
hours, so it's only the most confusing stuff I pour out here. I'm
actually progressing pretty well.


 (the streaming.Twitter ones especially) depend on there being a
 /mnt/spark and /mnt2/spark directory (I think for java tempfiles?) and
 those don't seem to exist out-of-the-box.

 I think this is a side-effect of the r3 instances not having those drives
 mounted. Our setup script would normally create these directories. What was
 the error?


Oh, I went back to m1.large while those issues get sorted out. I decided I
had enough problems without messing with that too. (seriously, why does
Amazon do these things? It's like they _try_ to make the instances
incompatible.)

I forget the exact error, but it traced through createTempFile and it was
fairly clear about the directory being missing. Things like
bin/run-example SparkPi worked fine, but I'll bet twitter4j creates temp
files, so bin/run-example streaming.TwitterPopularTags broke.

What did you change log4j.properties to? It should be changed to say
 log4j.rootCategory=WARN, console but maybe another log4j.properties is
 somehow arriving on the classpath. This is definitely a common problem so
 we need to add some explicit docs on it.


I seem to have this sorted out, don't ask me how. Once again I was probably
editing things on the cluster master when I should have been editing the
cluster controller, or vice versa. But, yeah, many of the examples just get
lost in a sea of DAG INFO messages.


 Are you going through http://spark.apache.org/docs/latest/quick-start.html?
 You should be able to do just sbt package. Once you do that you don’t need
 to deploy your application’s JAR to the cluster, just pass it to
 spark-submit and it will automatically be sent over.


Ah, that answers another question I just asked elsewhere... Yup, I re-read
pretty much every documentation page daily. And I'm making my way through
every video.


  Meanwhile I'm learning scala... Great Turing's Ghost, it's the dream
 language we've theorized about for years! I hadn't realized!

 Indeed, glad you’re enjoying it.


Enjoying, not yet alas, I'm sure I'll get there. But I do understand the
implications of a mixed functional-imperative language with closures and
lambdas. That is serious voodoo.

-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Jeremy Lee
Thanks Patrick!

Uberjars. Cool. I'd actually heard of them. And thanks for the link to the
example! I shall work through that today.

I'm still learning sbt and it's many options... the last new framework I
learned was node.js, and I think I've been rather spoiled by npm.

At least it's not maven. Please, oh please don't make me learn maven too.
(The only people who seem to like it have Software Stockholm Syndrome: I
know maven kidnapped me and beat me up, but if you spend long enough with
it, you eventually start to sympathize and see it's point of view.)


On Thu, Jun 5, 2014 at 3:39 AM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Jeremy,

 The issue is that you are using one of the external libraries and
 these aren't actually packaged with Spark on the cluster, so you need
 to create an uber jar that includes them.

 You can look at the example here (I recently did this for a kafka
 project and the idea is the same):

 https://github.com/pwendell/kafka-spark-example

 You'll want to make an uber jar that includes these packages (run sbt
 assembly) and then submit that jar to spark-submit. Also, I'd try
 running it locally first (if you aren't already) just to make the
 debugging simpler.

 - Patrick


 On Wed, Jun 4, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote:
  Ah sorry, this may be the thing I learned for the day. The issue is
  that classes from that particular artifact are missing though. Worth
  interrogating the resulting .jar file with jar tf to see if it made
  it in?
 
  On Wed, Jun 4, 2014 at 2:12 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:
  @Sean, the %% syntax in SBT should automatically add the Scala major
 version
  qualifier (_2.10, _2.11 etc) for you, so that does appear to be correct
  syntax for the build.
 
  I seemed to run into this issue with some missing Jackson deps, and
 solved
  it by including the jar explicitly on the driver class path:
 
  bin/spark-submit --driver-class-path
  SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar --class
 SimpleApp
  SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
 
  Seems redundant to me since I thought that the JAR as argument is
 copied to
  driver and made available. But this solved it for me so perhaps give it
 a
  try?
 
 
 
  On Wed, Jun 4, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:
 
  Those aren't the names of the artifacts:
 
 
 
 http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-streaming-twitter_2.10%22
 
  The name is spark-streaming-twitter_2.10
 
  On Wed, Jun 4, 2014 at 1:49 PM, Jeremy Lee
  unorthodox.engine...@gmail.com wrote:
   Man, this has been hard going. Six days, and I finally got a Hello
   World
   App working that I wrote myself.
  
   Now I'm trying to make a minimal streaming app based on the twitter
   examples, (running standalone right now while learning) and when
 running
   it
   like this:
  
   bin/spark-submit --class SimpleApp
   SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
  
   I'm getting this error:
  
   Exception in thread main java.lang.NoClassDefFoundError:
   org/apache/spark/streaming/twitter/TwitterUtils$
  
   Which I'm guessing is because I haven't put in a dependency to
   external/twitter in the .sbt, but _how_? I can't find any docs on
 it.
   Here's my build file so far:
  
   simple.sbt
   --
   name := Simple Project
  
   version := 1.0
  
   scalaVersion := 2.10.4
  
   libraryDependencies += org.apache.spark %% spark-core % 1.0.0
  
   libraryDependencies += org.apache.spark %% spark-streaming %
 1.0.0
  
   libraryDependencies += org.apache.spark %%
 spark-streaming-twitter %
   1.0.0
  
   libraryDependencies += org.twitter4j % twitter4j-stream % 3.0.3
  
   resolvers += Akka Repository at http://repo.akka.io/releases/;
   --
  
   I've tried a few obvious things like adding:
  
   libraryDependencies += org.apache.spark %% spark-external %
 1.0.0
  
   libraryDependencies += org.apache.spark %%
 spark-external-twitter %
   1.0.0
  
   because, well, that would match the naming scheme implied so far,
 but it
   errors.
  
  
   Also, I just realized I don't completely understand if:
   (a) the spark-submit command _sends_ the .jar to all the workers,
 or
   (b) the spark-submit commands sends a _job_ to the workers, which
 are
   supposed to already have the jar file installed (or in hdfs), or
   (c) the Context is supposed to list the jars to be distributed. (is
 that
   deprecated?)
  
   One part of the documentation says:
  
Once you have an assembled jar you can call the bin/spark-submit
   script as
   shown here while passing your jar.
  
   but another says:
  
   application-jar: Path to a bundled jar including your application
 and
   all
   dependencies. The URL must be globally visible inside of your
 cluster,
   for
   instance, an hdfs:// path or a file:// path that is present on all
   nodes.
  
   I suppose both could

Re: Why Scala?

2014-06-04 Thread Jeremy Lee
 safely ignored.


 On Thu, May 29, 2014 at 1:55 PM, Nick Chammas 
 nicholas.cham...@gmail.com wrote:

 I recently discovered Hacker News and started reading through older
 posts about Scala
 https://hn.algolia.com/?q=scala#!/story/forever/0/scala. It looks
 like the language is fairly controversial on there, and it got me thinking.

 Scala appears to be the preferred language to work with in Spark, and
 Spark itself is written in Scala, right?

 I know that often times a successful project evolves gradually out of
 something small, and that the choice of programming language may not always
 have been made consciously at the outset.

 But pretending that it was, why is Scala the preferred language of
 Spark?

 Nick


 --
 View this message in context: Why Scala?
 http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com
 http://nabble.com/.









-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Jeremy Lee
Hmm.. That's not working so well for me. First, I needed to add a
project/plugin.sbt file with the contents:

addSbtPlugin(com.eed3si9n % sbt-assembly % 0.11.4)

Before 'sbt/sbt assemble' worked at all. And I'm not sure about that
version number, but 0.9.1 isn't working much better and 11.4 is the
latest one recommended by the sbt project site. Where did you get your
version from?

Second, even when I do get it to build a .jar, spark-submit is still
telling me the external.twitter library is missing.

I tried using your github project as-is, but it also complained about the
missing plugin.. I'm trying it with various versions now to see if I can
get that working, even though I don't know anything about kafka. Hmm, and
no. Here's what I get:

[info] Set current project to Simple Project (in build
file:/home/ubuntu/spark-1.0.0/SparkKafka/)
[error] Not a valid command: assemble
[error] Not a valid project ID: assemble
[error] Expected ':' (if selecting a configuration)
[error] Not a valid key: assemble (similar: assembly, assemblyJarName,
assemblyDirectory)
[error] assemble
[error]

I also found this project which seemed to be exactly what I was after:
https://github.com/prabeesh/SparkTwitterAnalysis

...but it was for Spark 0.9, and though I updated all the version
references to 1.0.0, that one doesn't work either. I can't even get it to
build.

*sigh*

Is it going to be easier to just copy the external/ source code into my own
project? Because I will... especially if creating Uberjars takes this
long every... single... time...



On Thu, Jun 5, 2014 at 8:52 AM, Jeremy Lee unorthodox.engine...@gmail.com
wrote:

 Thanks Patrick!

 Uberjars. Cool. I'd actually heard of them. And thanks for the link to the
 example! I shall work through that today.

 I'm still learning sbt and it's many options... the last new framework I
 learned was node.js, and I think I've been rather spoiled by npm.

 At least it's not maven. Please, oh please don't make me learn maven too.
 (The only people who seem to like it have Software Stockholm Syndrome: I
 know maven kidnapped me and beat me up, but if you spend long enough with
 it, you eventually start to sympathize and see it's point of view.)


 On Thu, Jun 5, 2014 at 3:39 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Jeremy,

 The issue is that you are using one of the external libraries and
 these aren't actually packaged with Spark on the cluster, so you need
 to create an uber jar that includes them.

 You can look at the example here (I recently did this for a kafka
 project and the idea is the same):

 https://github.com/pwendell/kafka-spark-example

 You'll want to make an uber jar that includes these packages (run sbt
 assembly) and then submit that jar to spark-submit. Also, I'd try
 running it locally first (if you aren't already) just to make the
 debugging simpler.

 - Patrick


 On Wed, Jun 4, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote:
  Ah sorry, this may be the thing I learned for the day. The issue is
  that classes from that particular artifact are missing though. Worth
  interrogating the resulting .jar file with jar tf to see if it made
  it in?
 
  On Wed, Jun 4, 2014 at 2:12 PM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:
  @Sean, the %% syntax in SBT should automatically add the Scala major
 version
  qualifier (_2.10, _2.11 etc) for you, so that does appear to be correct
  syntax for the build.
 
  I seemed to run into this issue with some missing Jackson deps, and
 solved
  it by including the jar explicitly on the driver class path:
 
  bin/spark-submit --driver-class-path
  SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar --class
 SimpleApp
  SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
 
  Seems redundant to me since I thought that the JAR as argument is
 copied to
  driver and made available. But this solved it for me so perhaps give
 it a
  try?
 
 
 
  On Wed, Jun 4, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:
 
  Those aren't the names of the artifacts:
 
 
 
 http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-streaming-twitter_2.10%22
 
  The name is spark-streaming-twitter_2.10
 
  On Wed, Jun 4, 2014 at 1:49 PM, Jeremy Lee
  unorthodox.engine...@gmail.com wrote:
   Man, this has been hard going. Six days, and I finally got a Hello
   World
   App working that I wrote myself.
  
   Now I'm trying to make a minimal streaming app based on the twitter
   examples, (running standalone right now while learning) and when
 running
   it
   like this:
  
   bin/spark-submit --class SimpleApp
   SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
  
   I'm getting this error:
  
   Exception in thread main java.lang.NoClassDefFoundError:
   org/apache/spark/streaming/twitter/TwitterUtils$
  
   Which I'm guessing is because I haven't put in a dependency to
   external/twitter in the .sbt, but _how_? I can't find any docs on
 it.
   Here's my build file so far:
  
   simple.sbt

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-03 Thread Jeremy Lee
Thanks for that, Matei! I'll look at that once I get a spare moment. :-)

If you like, I'll keep documenting my newbie problems and frustrations...
perhaps it might make things easier for others.

Another issue I seem to have found (now that I can get small clusters up):
some of the examples (the streaming.Twitter ones especially) depend on
there being a /mnt/spark and /mnt2/spark directory (I think for java
tempfiles?) and those don't seem to exist out-of-the-box. I have to create
those directories and use copy-dir to get them to the workers before
those examples run.

Much of the the last two days for me have been about failing to get any of
my own code to work, except for in spark-shell. (which is very nice, btw)

At first I tried editing the examples, because I took the documentation
literally when it said Finally, Spark includes several samples in the
examples directory (Scala, Java, Python). You can run them as follows:
 but of course didn't realize editing them is pointless because while the
source is there, the code is actually pulled from a .jar elsewhere. Doh.
(so obvious in hindsight)

I couldn't even turn down the voluminous INFO messages to WARNs, no matter
how many conf/log4j.properties files I edited or copy-dir'd. I'm sure
there's a trick to that I'm not getting.

Even trying to build SimpleApp I've run into the problem that all the
documentation says to use sbt/sbt assemble, but sbt doesn't seem to be in
the 1.0.0 pre-built packages that I downloaded.

Ah... yes.. there it is in the source package. I suppose that means that in
order to deploy any new code to the cluster, I've got to rebuild from
source on my cluster controller. OK, I never liked that Amazon Linux AMI
anyway. I'm going to start from scratch again with an Ubuntu 12.04
instance, hopefully that will be more auspicious...

Meanwhile I'm learning scala... Great Turing's Ghost, it's the dream
language we've theorized about for years! I hadn't realized!



On Mon, Jun 2, 2014 at 12:05 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track
 this.

 Matei


 On Jun 1, 2014, at 6:14 PM, Jeremy Lee unorthodox.engine...@gmail.com
 wrote:

 Sort of.. there were two separate issues, but both related to AWS..

 I've sorted the confusion about the Master/Worker AMI ... use the version
 chosen by the scripts. (and use the right instance type so the script can
 choose wisely)

 But yes, one also needs a launch machine to kick off the cluster, and
 for that I _also_ was using an Amazon instance... (made sense.. I have a
 team that will needs to do things as well, not just me) and I was just
 pointing out that if you use the most recommended by Amazon AMI (for your
 free micro instance, for example) you get python 2.6 and the ec2 scripts
 fail.

 That merely needs a line in the documentation saying use Ubuntu for your
 cluster controller, not Amazon Linux or somesuch. But yeah, for a newbie,
 it was hard working out when to use default or custom AMIs for various
 parts of the setup.


 On Mon, Jun 2, 2014 at 4:01 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey just to clarify this - my understanding is that the poster
 (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
 launch spark-ec2 from my laptop. And he was looking for an AMI that
 had a high enough version of python.

 Spark-ec2 itself has a flag -a that allows you to give a specific
 AMI. This flag is just an internal tool that we use for testing when
 we spin new AMI's. Users can't set that to an arbitrary AMI because we
 tightly control things like the Java and OS versions, libraries, etc.


 On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
 unorthodox.engine...@gmail.com wrote:
  *sigh* OK, I figured it out. (Thank you Nick, for the hint)
 
  m1.large works, (I swear I tested that earlier and had similar
 issues... )
 
  It was my obsession with starting r3.*large instances. Clearly I
 hadn't
  patched the script in all the places.. which I think caused it to
 default to
  the Amazon AMI. I'll have to take a closer look at the code and see if I
  can't fix it correctly, because I really, really do want nodes with 2x
 the
  CPU and 4x the memory for the same low spot price. :-)
 
  I've got a cluster up now, at least. Time for the fun stuff...
 
  Thanks everyone for the help!
 
 
 
  On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
 
  If you are explicitly specifying the AMI in your invocation of
 spark-ec2,
  may I suggest simply removing any explicit mention of AMI from your
  invocation? spark-ec2 automatically selects an appropriate AMI based
 on the
  specified instance type.
 
  2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한
 메시지:
 
  Could you post how exactly you are invoking spark-ec2? And are you
 having
  trouble just with r3 instances, or with any instance type?
 
  2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한

Re: Spark on EC2

2014-06-01 Thread Jeremy Lee
Hmm.. you've gotten further than me. Which AMI's are you using?


On Sun, Jun 1, 2014 at 2:21 PM, superback andrew.matrix.c...@gmail.com
wrote:

 Hi,
 I am trying to run an example on AMAZON EC2 and have successfully
 set up one cluster with two nodes on EC2. However, when I was testing an
 example using the following command,

 *
 ./run-example org.apache.spark.examples.GroupByTest
 spark://`hostname`:7077*

 I got the following warnings and errors. Can anyone help one solve this
 problem? Thanks very much!

 46781 [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial
 job has not accepted any resources; check your cluster UI to ensure that
 workers are registered and have sufficient memory
 61544 [spark-akka.actor.default-dispatcher-3] ERROR
 org.apache.spark.deploy.client.AppClient$ClientActor - All masters are
 unresponsive! Giving up.
 61544 [spark-akka.actor.default-dispatcher-3] ERROR
 org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - Spark
 cluster looks dead, giving up.
 61546 [spark-akka.actor.default-dispatcher-3] INFO
 org.apache.spark.scheduler.TaskSchedulerImpl - Remove TaskSet 0.0 from pool
 61549 [main] INFO org.apache.spark.scheduler.DAGScheduler - Failed to run
 count at GroupByTest.scala:50
 Exception in thread main org.apache.spark.SparkException: Job aborted:
 Spark cluster looks down
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 at scala.Option.foreach(Option.scala:236)
 at

 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)







 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Jeremy Lee
*sigh* OK, I figured it out. (Thank you Nick, for the hint)

m1.large works, (I swear I tested that earlier and had similar issues...
)

It was my obsession with starting r3.*large instances. Clearly I hadn't
patched the script in all the places.. which I think caused it to default
to the Amazon AMI. I'll have to take a closer look at the code and see if I
can't fix it correctly, because I really, really do want nodes with 2x the
CPU and 4x the memory for the same low spot price. :-)

I've got a cluster up now, at least. Time for the fun stuff...

Thanks everyone for the help!



On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 If you are explicitly specifying the AMI in your invocation of spark-ec2,
 may I suggest simply removing any explicit mention of AMI from your
 invocation? spark-ec2 automatically selects an appropriate AMI based on
 the specified instance type.

 2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지:

 Could you post how exactly you are invoking spark-ec2? And are you having
 trouble just with r3 instances, or with any instance type?

 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지:

 It's been another day of spinning up dead clusters...

 I thought I'd finally worked out what everyone else knew - don't use the
 default AMI - but I've now run through all of the official quick-start
 linux releases and I'm none the wiser:

 Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
 Provisions servers, connects, installs, but the webserver on the master
 will not start

 Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
 Spot instance requests are not supported for this AMI.

 SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
 Not tested - costs 10x more for spot instances, not economically viable.

 Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
 Provisions servers, but git is not pre-installed, so the cluster setup
 fails.

 Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
 Provisions servers, but git is not pre-installed, so the cluster setup
 fails.




-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Trouble with EC2

2014-06-01 Thread Jeremy Lee
Ha yes,,, I just went through this.

(a) You have to use the ;'default' spark AMI, ( ami-7a320f3f at the moment
) and not any of the other linux distros. They don't work.
(b) Start with m1.large instances.. I tried going for r3.large at first,
and had no end of self-caused trouble. m1.large works.
(c) It's possible for the script to choose the wrong AMI, especially if one
has been messing with it to allow other instance types. (ahem)

But it will work in the end.. just start simple. (yeah, I know m1.large
doesn't look that large anymore. :-)


On Mon, Jun 2, 2014 at 8:11 AM, PJ$ p...@chickenandwaffl.es wrote:

 Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't
 gotten any further. No clue what's wrong. I'd really appreciate any
 guidance y'all could offer.

 Best,
 PJ$


 On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 What instance types did you launch on?

 Sometimes you also get a bad individual machine from EC2. It might help
 to remove the node it’s complaining about from the conf/slaves file.

 Matei

 On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote:

 Hey Folks,

 I'm really having quite a bit of trouble getting spark running on ec2.
 I'm not using scripts the https://github.com/apache/spark/tree/master/ec2
 because I'd like to know how everything works. But I'm going a little
 crazy. I think that something about the networking configuration must be
 messed up, but I'm at a loss. Shortly after starting the cluster, I get a
 lot of this:

 14/05/30 18:03:22 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:22 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:23 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:23 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 INFO actor.LocalActorRef: Message
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
 Actor[akka://sparkMaster/deadLetters] to
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246]
 was not delivered. [5] dead letters encountered. This logging can be turned
 off or adjusted with configuration settings 'akka.log-dead-letters' and
 'akka.log-dead-letters-during-shutdown'.
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
 - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
 [Association failed with
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with [
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
 ]
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
 - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
 [Association failed with
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with [
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
 ]
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
 - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
 [Association failed with
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with [
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485






-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Jeremy Lee
Sort of.. there were two separate issues, but both related to AWS..

I've sorted the confusion about the Master/Worker AMI ... use the version
chosen by the scripts. (and use the right instance type so the script can
choose wisely)

But yes, one also needs a launch machine to kick off the cluster, and for
that I _also_ was using an Amazon instance... (made sense.. I have a team
that will needs to do things as well, not just me) and I was just pointing
out that if you use the most recommended by Amazon AMI (for your free
micro instance, for example) you get python 2.6 and the ec2 scripts fail.

That merely needs a line in the documentation saying use Ubuntu for your
cluster controller, not Amazon Linux or somesuch. But yeah, for a newbie,
it was hard working out when to use default or custom AMIs for various
parts of the setup.


On Mon, Jun 2, 2014 at 4:01 AM, Patrick Wendell pwend...@gmail.com wrote:

 Hey just to clarify this - my understanding is that the poster
 (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
 launch spark-ec2 from my laptop. And he was looking for an AMI that
 had a high enough version of python.

 Spark-ec2 itself has a flag -a that allows you to give a specific
 AMI. This flag is just an internal tool that we use for testing when
 we spin new AMI's. Users can't set that to an arbitrary AMI because we
 tightly control things like the Java and OS versions, libraries, etc.


 On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
 unorthodox.engine...@gmail.com wrote:
  *sigh* OK, I figured it out. (Thank you Nick, for the hint)
 
  m1.large works, (I swear I tested that earlier and had similar
 issues... )
 
  It was my obsession with starting r3.*large instances. Clearly I hadn't
  patched the script in all the places.. which I think caused it to
 default to
  the Amazon AMI. I'll have to take a closer look at the code and see if I
  can't fix it correctly, because I really, really do want nodes with 2x
 the
  CPU and 4x the memory for the same low spot price. :-)
 
  I've got a cluster up now, at least. Time for the fun stuff...
 
  Thanks everyone for the help!
 
 
 
  On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
 
  If you are explicitly specifying the AMI in your invocation of
 spark-ec2,
  may I suggest simply removing any explicit mention of AMI from your
  invocation? spark-ec2 automatically selects an appropriate AMI based on
 the
  specified instance type.
 
  2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한
 메시지:
 
  Could you post how exactly you are invoking spark-ec2? And are you
 having
  trouble just with r3 instances, or with any instance type?
 
  2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지:
 
  It's been another day of spinning up dead clusters...
 
  I thought I'd finally worked out what everyone else knew - don't use
 the
  default AMI - but I've now run through all of the official
 quick-start
  linux releases and I'm none the wiser:
 
  Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
  Provisions servers, connects, installs, but the webserver on the master
  will not start
 
  Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
  Spot instance requests are not supported for this AMI.
 
  SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
  Not tested - costs 10x more for spot instances, not economically
 viable.
 
  Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
  Provisions servers, but git is not pre-installed, so the cluster
 setup
  fails.
 
  Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
  Provisions servers, but git is not pre-installed, so the cluster
 setup
  fails.
 
 
 
 
  --
  Jeremy Lee  BCompSci(Hons)
The Unorthodox Engineers




-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-31 Thread Jeremy Lee
Hi there, Patrick. Thanks for the reply...

It wouldn't surprise me that AWS Ubuntu has Python 2.7. Ubuntu is cool like
that. :-)

Alas, the Amazon Linux AMI (2014.03.1) does not, and it's the very first
one on the recommended instance list. (Ubuntu is #4, after Amazon, RedHat,
SUSE) So, users such as myself who deliberately pick the Most Amazon-ish
obvious first choice find they picked the wrong one.

But that's trivial compared to the failure of the cluster to come up,
apparently due to the master's http configuration. Any help on that would
be much appreciated... it's giving me serious grief.



On Sat, May 31, 2014 at 1:37 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hi Jeremy,

 That's interesting, I don't think anyone has ever reported an issue
 running these scripts due to Python incompatibility, but they may require
 Python 2.7+. I regularly run them from the AWS Ubuntu 12.04 AMI... that
 might be a good place to start. But if there is a straightforward way to
 make them compatible with 2.6 we should do that.

 For r3.large, we can add that to the script. It's a newer type. Any
 interest in contributing this?

 - Patrick

 On May 30, 2014 5:08 AM, Jeremy Lee unorthodox.engine...@gmail.com
 wrote:


 Hi there! I'm relatively new to the list, so sorry if this is a repeat:

 I just wanted to mention there are still problems with the EC2 scripts.
 Basically, they don't work.

 First, if you run the scripts on Amazon's own suggested version of linux,
 they break because amazon installs Python2.6.9, and the scripts use a
 couple of Python2.7 commands. I have to sudo yum install python27, and
 then edit the spark-ec2 shell script to use that specific version.
 Annoying, but minor.

 (the base python command isn't upgraded to 2.7 on many systems,
 apparently because it would break yum)

 The second minor problem is that the script doesn't know about the
 r3.large servers... also easily fixed by adding to the spark_ec2.py
 script. Minor,

 The big problem is that after the EC2 cluster is provisioned, installed,
 set up, and everything, it fails to start up the webserver on the master.
 Here's the tail of the log:

 Starting GANGLIA gmond:[  OK  ]
 Shutting down GANGLIA gmond:   [FAILED]
 Starting GANGLIA gmond:[  OK  ]
 Connection to ec2-54-183-82-48.us-west-1.compute.amazonaws.com closed.
 Shutting down GANGLIA gmond:   [FAILED]
 Starting GANGLIA gmond:[  OK  ]
 Connection to ec2-54-183-82-24.us-west-1.compute.amazonaws.com closed.
 Shutting down GANGLIA gmetad:  [FAILED]
 Starting GANGLIA gmetad:   [  OK  ]
 Stopping httpd:[FAILED]
 Starting httpd: httpd: Syntax error on line 153 of
 /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into
 server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object
 file: No such file or directory
[FAILED]

 Basically, the AMI you have chosen does not seem to have a full install
 of apache, and is missing several modules that are referred to in the
 httpd.conf file that is installed. The full list of missing modules is:

 authn_alias_module modules/mod_authn_alias.so
 authn_default_module modules/mod_authn_default.so
 authz_default_module modules/mod_authz_default.so
 ldap_module modules/mod_ldap.so
 authnz_ldap_module modules/mod_authnz_ldap.so
 disk_cache_module modules/mod_disk_cache.so

 Alas, even if these modules are commented out, the server still fails to
 start.

 root@ip-172-31-11-193 ~]$ service httpd start
 Starting httpd: AH00534: httpd: Configuration error: No MPM loaded.

 That means Spark 1.0.0 clusters on EC2 are Dead-On-Arrival when run
 according to the instructions. Sorry.

 Any suggestions on how to proceed? I'll keep trying to fix the webserver,
 but (a) changes to httpd.conf get blown away by resume, and (b) anything
 I do has to be redone every time I provision another cluster. Ugh.

 --
 Jeremy Lee  BCompSci(Hons)
   The Unorthodox Engineers








-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-31 Thread Jeremy Lee
It's been another day of spinning up dead clusters...

I thought I'd finally worked out what everyone else knew - don't use the
default AMI - but I've now run through all of the official quick-start
linux releases and I'm none the wiser:

Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
Provisions servers, connects, installs, but the webserver on the master
will not start

Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
Spot instance requests are not supported for this AMI.

SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
Not tested - costs 10x more for spot instances, not economically viable.

Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

Have I missed something? What AMI's are people using? I've just gone back
through the archives, and I'm seeing a lot of I can't get EC2 to work and
not a single My EC2 has post-install issues,

The quickstart page says ...can have a spark cluster up and running in
five minutes. But it's been three days for me so far. I'm about to bite
the bullet and start building my own AMI's from scratch... if anyone can
save me from that, I'd be most grateful.

-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers