Hi I am new to spark and started writing some simple test code to figure out how things works. I am very interested in spark streaming and python. It appears that streaming is not supported in python yet. The work around I found by googling is to write your streaming code in either Scala or Java and use RDD pipe() to fetch the data into your python app. I do not think I am getting parallel execution. In my current experiment I am using a mac book pro with 8 cores. I wrote a java job that process a small data file on disk, uses a couple of transformations and writes the data to standard out.
I have a simple python script that calls pipe() import sys from operator import add from pyspark import SparkContext sc = SparkContext(appName="pySparkPipeJava") data = [1, 2, 3, 4, 5] distData = sc.parallelize(data) output = distData.pipe("../bin/runJavaJob.sh").collect() # pipe() returns strings for (num) in output: print "pySpark: value from java job %s" % (num) I submit the python job as follows $ $SPARK_HOME/bin/spark-submit --master local[1] pySparkPipeJava.py Everything works as expected as long as I only use a single core. If I use 4 cores I get back 4 copies of data. My understanding is that the shell script will execute on all the workers. In general I want all the transforms and actions to run in parallel how ever I only want to process a single set of data in my python script. Here is the code for runJavaJob.sh $SPARK_HOME/bin/spark-submit \ --class ³myJavaSrc" \ --master local[4] \ $jarPath What is really strange is if I replace runJavaJob.sh with a simple shell script that basically just echoes I can run with 4 cores and only get back one set of data. Any idea what the difference is? (my java child does not read anything from the python app, the echo script does) I tried changing the number of cores in runJavaJob.sh but that does not seem to change how much data I get back Seems like being limited to a single core would be severely limiting. Here is the code for my ³echo² script #!/bin/sh # # Use this shell script to figure out how spark RDD pipe() works # #set -x # turns shell debugging on #set +x # turns shell debugging off while read x ; do echo RDDPipe.sh $x ; Done Thanks in advance Andy