Hi I have a simple streaming app. All I want to do is figure out how many lines I have received in the current mini batch. If numLines was a JavaRDD I could simply call count(). How do you do something similar in Streaming?
Here is my psudo code JavaDStream<String> msg = logs.filter(selectINFO); JavaDStream<Long> numLines = msg.count() Long totalCount = numLines ??? Here is what I am really trying to do. I have a python script that generated a graph of totalCount vs time. Python does not support streaming. As a work around I have a java program that does the steaming. I want to pass the data back to the python script. It has been suggested I can use rdd.pipe(). In python I call rdd.pipe(scriptToStartJavaSteam.sh) All I need to do is for each mini batch figure out how to get the the count of the current mini batch and write it to standard out. Seems like this should be simple. Maybe Streams do not work the way I think? In a spark core app, I am able to get values like count in my driver and do what ever I want with the local value. With streams I know I am getting mini patches because print() display the first 10 lines of my steam. I assume that some how print is executed in my driver so somehow data was sent from the workers back to the driver. Any comments or suggestions would be greatly appreciated. Andy P.s. Should I be asking a different question?