I have a python script that is used to submit spark jobs using the
spark-submit tool. I want to execute the command and write the output both
to STDOUT and a logfile in real time. i'm using python 2.7 on a ubuntu
This is what I have so far in my SubmitJob.py script
# Submit the commanddef submitJob(cmd, log_file):
with open(log_file, 'w') as fh:
process = subprocess.Popen(cmd, stdout=subprocess.PIPE,
output = process.stdout.readline()
if output == '' and process.poll() is not None:
rc = process.poll()
if __name__ == "__main__":
cmdList = ["dse", "spark-submit", "--spark-master",
"spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]
log_file = "/tmp/out.log"
exist_status = submitJob(cmdList, log_file)
print "job finished with status ",exist_status
The strange thing is, when I execute the same command directly in the shell
it works fine and produces output on screen as the program proceeds.
So it looks like something is wrong in the way I'm using the
subprocess.PIPE for stdout and writing the file.
What's the current recommended way to use subprocess module for writing to
stdout and log file in real time line by line? I see a lot of different
options on the internet but not sure which is correct or latest.
Is there anything specific to the way spark-submit buffers the stdout that
I need to take care of?