Hi 

I have a simple streaming app. All I want to do is figure out how many lines
I have received in the current mini batch. If numLines was a JavaRDD I could
simply call count(). How do you do something similar in Streaming?



Here is my psudo code



JavaDStream<String> msg = logs.filter(selectINFO);

JavaDStream<Long> numLines  = msg.count()



Long totalCount = numLines ???





Here is what I am really trying to do. I have a python script that generated
a graph of totalCount vs time. Python does not support streaming. As a work
around I have a java program that does the steaming. I want to pass the data
back to the python script. It has been suggested I can use rdd.pipe().



In python I call rdd.pipe(scriptToStartJavaSteam.sh)



All I need to do is for each mini batch figure out how to get the the count
of the current mini batch and write it to standard out. Seems like this
should be simple. 



Maybe Streams do not work the way I think? In a spark core app, I am able to
get values like count in my driver and do what ever I want with the local
value. With streams I know I am getting mini patches because print() display
the first 10 lines of my steam. I assume that some how print is executed in
my driver so somehow  data was sent from the workers back to the driver.



Any comments or suggestions would be greatly appreciated.



Andy



P.s. Should I be asking a different question?










Reply via email to