It's much easier than all this. Spark Streaming gives you a DStream of
RDDs. You want the count for each RDD. DStream.count() gives you
exactly that: a DStream of Longs which are the counts of events in
each mini batch.

On Tue, Sep 30, 2014 at 8:42 PM, Andy Davidson
<a...@santacruzintegration.com> wrote:
> Hi
>
> I have a simple streaming app. All I want to do is figure out how many lines
> I have received in the current mini batch. If numLines was a JavaRDD I could
> simply call count(). How do you do something similar in Streaming?
>
>
> Here is my psudo code
>
>
>
> JavaDStream<String> msg = logs.filter(selectINFO);
>
> JavaDStream<Long> numLines  = msg.count()
>
>
> Long totalCount = numLines ???
>
>
>
> Here is what I am really trying to do. I have a python script that generated
> a graph of totalCount vs time. Python does not support streaming. As a work
> around I have a java program that does the steaming. I want to pass the data
> back to the python script. It has been suggested I can use rdd.pipe().
>
>
> In python I call rdd.pipe(scriptToStartJavaSteam.sh)
>
>
> All I need to do is for each mini batch figure out how to get the the count
> of the current mini batch and write it to standard out. Seems like this
> should be simple.
>
>
> Maybe Streams do not work the way I think? In a spark core app, I am able to
> get values like count in my driver and do what ever I want with the local
> value. With streams I know I am getting mini patches because print() display
> the first 10 lines of my steam. I assume that some how print is executed in
> my driver so somehow  data was sent from the workers back to the driver.
>
>
> Any comments or suggestions would be greatly appreciated.
>
>
> Andy
>
>
> P.s. Should I be asking a different question?
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to