Hi Tathagata,
I am very new to Spark streaming and I have never used the pipe() function
yet.
I have written a Spark streaming program (JAVA API) which is receiving data
from Kafka and simply printing now.
*JavaStreamingContext ssc = new JavaStreamingContext(args[0],
"SparkStreamExample", new Duration(1000),
System.getenv("SPARK_HOME"),
JavaStreamingContext.jarOfClass(SparkStreamExample.class));
JavaPairDStream<String, String> messages =
KafkaUtils.createStream(ssc, args[1], args[2],
topicMap);
messages.print();
ssc.start();
ssc.awaitTermination();*
There is another simple Spark program (Python API) which does some data
cleaning and saves it to HDFS.
*if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: <master> <file>"
exit(-1)
# Instead of reading from HDFS, I want this program to read from the
Java-Spark streaming process
sc = SparkContext(sys.argv[1], "TextCleanUp")
lines = sc.textFile(sys.argv[2])
cleanText = lines.map(cleanFunction).filter(lambda x: len(x) > 0)
cleanText.saveAsTextFile("hdfs://<IP>/user/root/cleanout")*
What I want is that the Python Spark program should read the data from the
std output of the Java Spark streaming program. I have somehow understood
that I need to use pipe() for this. But I am unable to understand how to use
it.
Can you please provide me with an example of how to use Spark's *pipe()*
function for the above context?
Thanks in advance.
Regards,
Gaurav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-PySpark-for-Streaming-tp1882p6391.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.