If you have been able to run Spark Pi to run on YARN, then you should be able to run the streaming example HdfsWordCount<https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/HdfsWordCount.scala> as well. Even though the instructions in the example says to run it on local machine, you can run the example on YARN as well in the same way as Spark PI. You would just have to give the appropriate Spark master url and use an HDFS directory as the 2nd parameter. Then any text file written to that HDFS directory will get "word counted".
Note that you should write a file to that HDFS directory by moving the file from some other directory to that directory. For example if the HDFS directory that you want to use to run the example is *hdfs://myhdfs:9000/mydir/* , then you can first copy a local file (say new_file) to "*hdfs://myhdfs:9000/temp_location/new_file *" then do a move it to "*hdfs://myhdfs:9000/mydir/new_file*". On Thu, Jan 9, 2014 at 5:29 PM, Mike Percy <[email protected]> wrote: > After looking through the docs, grepping the commit logs and looking on > the list archives, I have been unable to see an indication or example of > Spark streaming working on YARN. Is this possible yet? So far, I've gotten > at least the Spark Pi example to run on YARN with CDH5 beta 1. > > I am about to dig into the code and try to figure out how the batch Yarn > client works, to see how much work it would be to set up an AM to run an > InputDStream, but thought I'd make it easy on myself ask here first before > I got started. > > Thanks in advance for any pointers, > Mike > >
