Re: Using PySpark for Streaming

gaurav.dasgupta Mon, 26 May 2014 02:21:07 -0700

Hi Tathagata,

I am very new to Spark streaming and I have never used the pipe() function
yet.


I have written a Spark streaming program (JAVA API) which is receiving data
from Kafka and simply printing now. 

*JavaStreamingContext ssc = new JavaStreamingContext(args[0], 
                                "SparkStreamExample", new Duration(1000), 
                                System.getenv("SPARK_HOME"), 
                                
JavaStreamingContext.jarOfClass(SparkStreamExample.class));

JavaPairDStream<String, String> messages = 
                                KafkaUtils.createStream(ssc, args[1], args[2], 
topicMap);

messages.print();

ssc.start();
ssc.awaitTermination();*

There is another simple Spark program (Python API) which does some data
cleaning and saves it to HDFS. 

*if __name__ == "__main__":
    if len(sys.argv) < 2:
        print >> sys.stderr, "Usage: <master> <file>"
        exit(-1)

    # Instead of reading from HDFS, I want this program to read from the
Java-Spark streaming process
    sc = SparkContext(sys.argv[1], "TextCleanUp")
    lines = sc.textFile(sys.argv[2])
    cleanText = lines.map(cleanFunction).filter(lambda x: len(x) > 0)
    cleanText.saveAsTextFile("hdfs://<IP>/user/root/cleanout")*

What I want is that the Python Spark program should read the data from the
std output of the Java Spark streaming program. I have somehow understood
that I need to use pipe() for this. But I am unable to understand how to use
it. 

Can you please provide me with an example of how to use Spark's *pipe()*
function for the above context?

Thanks in advance.

Regards,
Gaurav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-PySpark-for-Streaming-tp1882p6391.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Using PySpark for Streaming

Reply via email to