It's much easier than all this. Spark Streaming gives you a DStream of RDDs. You want the count for each RDD. DStream.count() gives you exactly that: a DStream of Longs which are the counts of events in each mini batch.
On Tue, Sep 30, 2014 at 8:42 PM, Andy Davidson <a...@santacruzintegration.com> wrote: > Hi > > I have a simple streaming app. All I want to do is figure out how many lines > I have received in the current mini batch. If numLines was a JavaRDD I could > simply call count(). How do you do something similar in Streaming? > > > Here is my psudo code > > > > JavaDStream<String> msg = logs.filter(selectINFO); > > JavaDStream<Long> numLines = msg.count() > > > Long totalCount = numLines ??? > > > > Here is what I am really trying to do. I have a python script that generated > a graph of totalCount vs time. Python does not support streaming. As a work > around I have a java program that does the steaming. I want to pass the data > back to the python script. It has been suggested I can use rdd.pipe(). > > > In python I call rdd.pipe(scriptToStartJavaSteam.sh) > > > All I need to do is for each mini batch figure out how to get the the count > of the current mini batch and write it to standard out. Seems like this > should be simple. > > > Maybe Streams do not work the way I think? In a spark core app, I am able to > get values like count in my driver and do what ever I want with the local > value. With streams I know I am getting mini patches because print() display > the first 10 lines of my steam. I assume that some how print is executed in > my driver so somehow data was sent from the workers back to the driver. > > > Any comments or suggestions would be greatly appreciated. > > > Andy > > > P.s. Should I be asking a different question? > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org