Hello I am trying to build a system that does a very simple calculation on a stream and displays the results in a graph that I want to update the graph every second or so. I think I have a fundamental mis understanding about how steams and rdd.pipe() works. I want to do the data visualization part using Ipython notebook. Its really easy to graph, animate, and share the page. I understand streaming does not work in python yet. Googleing around it appears you can use RDD.pipe() to get the streaming data into python.
I have a little Ipython notebook I have been experimenting with. It use rdd.pipe() to run the following java job The psudo code is Main() { JavaStreamingContext ssc = new JavaStreamingContext(jsc, new Duration(1000)); JavaDStream<String> logs = createStream(ssc) JavaDStream<String> msg = logs.filter(selectMsgLevel); JavaDStream<Long> count = msg.count() Logs.print() ssc.start(); ssc.awaitTermination(); } I do not understand how I can pass any data back to python. If my understanding is correct everything runs on a worker, there is no way for the driver to get get the value of mini batch count and write it out to standard out. The Streaming documentation on design patterns for using foreachRDD demonstrates how the slaves/works can send data to other systems. So does the over all architecture of my little demo need to be something like Process a) iPython Note book rdd.pipe(myReader.sh) Process b) myReader.sh is basically some little daemon process that the workers can connect to. It will just write what ever it receives to standard out. Process c) is java spark streaming code Do I really need the ³daemon in the middle² ? Any insights would be greatly appreciated Andy P.s. I assume that if I wanted to use aggregates I would need to use JDStream.wrappRDD() . Is this correct? Is it expensive?