Hi, I'm having problems using pipe() from a Spark program written in Java, where I call a python script, running in a YARN cluster. The problem is that the job fails when YARN kills the container because the python script is going beyond the memory limits. I get something like this in the log:
01_000004. Exit status: 143. Diagnostics: Container [pid=6976,containerID=container_1424279690678_0078_01_000004] is running beyond physical memory limits. Current usage: 7.5 GB of 7.5 GB physical memory used; 8.6 GB of 23.3 GB virtual memory used. Killing container. Dump of the process-tree for container_1424279690678_0078_01_000004 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 6976 1457 6976 6976 (bash) 0 0 108613632 338 /bin/bash -c /usr/java/jdk1.7.0_71/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms7048m -Xmx7048m -Djava.io.tmpdir=/mnt/data1/hadoop/yarn/local/usercache/root/appcache/application_1424279690678_0078/container_1424279690678_0078_01_000004/tmp '-Dspark.driver.port=33589' '-Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp:// sparkdri...@slave3.lambdoop.com:33589/user/CoarseGrainedScheduler 5 slave1.lambdoop.com 1 application_1424279690678_0078 1> /mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004/stdout 2> /mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004/stderr |- 10513 6982 6976 6976 (python2.7) 9308 1224 448360448 13857 /usr/local/bin/python2.7 /mnt/my_script.py my_args |- 6982 6976 6976 6976 (java) 115176 12032 8632229888 1951974 /usr/java/jdk1.7.0_71/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms7048m -Xmx7048m -Djava.io.tmpdir=/mnt/data1/hadoop/yarn/local/usercache/root/appcache/application_1424279690678_0078/container_1424279690678_0078_01_000004/tmp -Dspark.driver.port=33589 -Dspark.ui.port=0 -Dspark.yarn.app.container.log.dir=/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp:// sparkdri...@slave3.lambdoop.com:33589/user/CoarseGrainedScheduler 5 slave1.lambdoop.com 1 application_1424279690678_0078 Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 I find this strange because the python script process each input line separately, and makes a simple independent calculation per line: it basically parses the line, calcules the Haversine distance, and returns a double value. Input lines are traversed in python with a loop "for line in sys.stdin". Also to avoid memory leaks in Python: - I call sys.stdout.flush() per each output line generated by python. - I call the following function after writing each output line, to force garbage collection regularly in Python: _iterations_until_gc = 1000 iterations_since_gc = 0 def update_garbage_collector(): global iterations_since_gc if iterations_since_gc >= _iterations_until_gc: gc.collect() iterations_since_gc = 0 else: iterations_since_gc += 1 So the memory consumption of the script should be constant, but in practice it looks like there is some memory leak, maybe Spark is introducing some memory leak when redirecting the IO in pipe()? Has any of you experienced similar situations when using pipe in Spark? Also, do you know how could I control the amount of memory reserved for the subprocess that is created by pipe. I understand than with --executor-memory I set the memory for the Spark executor process, but not for the subprocess created by pipe. Thanks in advance for your help. Greetings, Juan