Memory problems when calling pipe()

Juan Rodríguez Hortalá Mon, 23 Feb 2015 01:40:59 -0800

 Hi,

I'm having problems using pipe() from a Spark program written in Java,
where I call a python script, running in a YARN cluster. The problem is
that the job fails when YARN kills the container because the python script
is going beyond the memory limits. I get something like this in the log:


01_000004. Exit status: 143. Diagnostics: Container
[pid=6976,containerID=container_1424279690678_0078_01_000004] is running
beyond physical memory limits. Current usage: 7.5 GB of 7.5 GB physical
memory used; 8.6 GB of 23.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_1424279690678_0078_01_000004 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 6976 1457 6976 6976 (bash) 0 0 108613632 338 /bin/bash -c
/usr/java/jdk1.7.0_71/bin/java -server -XX:OnOutOfMemoryError='kill %p'
-Xms7048m -Xmx7048m
-Djava.io.tmpdir=/mnt/data1/hadoop/yarn/local/usercache/root/appcache/application_1424279690678_0078/container_1424279690678_0078_01_000004/tmp
'-Dspark.driver.port=33589' '-Dspark.ui.port=0'
-Dspark.yarn.app.container.log.dir=/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004
org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://
sparkdri...@slave3.lambdoop.com:33589/user/CoarseGrainedScheduler 5
slave1.lambdoop.com 1 application_1424279690678_0078 1>
/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004/stdout
2>
/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004/stderr

    |- 10513 6982 6976 6976 (python2.7) 9308 1224 448360448 13857
/usr/local/bin/python2.7 /mnt/my_script.py my_args
    |- 6982 6976 6976 6976 (java) 115176 12032 8632229888 1951974
/usr/java/jdk1.7.0_71/bin/java -server -XX:OnOutOfMemoryError=kill %p
-Xms7048m -Xmx7048m
-Djava.io.tmpdir=/mnt/data1/hadoop/yarn/local/usercache/root/appcache/application_1424279690678_0078/container_1424279690678_0078_01_000004/tmp
-Dspark.driver.port=33589 -Dspark.ui.port=0
-Dspark.yarn.app.container.log.dir=/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_000004
org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://
sparkdri...@slave3.lambdoop.com:33589/user/CoarseGrainedScheduler 5
slave1.lambdoop.com 1 application_1424279690678_0078

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

I find this strange because the python script process each input line
separately, and makes a simple independent calculation per line: it
basically parses the line, calcules the Haversine distance, and returns a
double value. Input lines are traversed in python with a loop "for line in
sys.stdin". Also to avoid memory leaks in Python:
- I call sys.stdout.flush() per each output line generated by python.
- I call the following function after writing each output line, to force
garbage collection regularly in Python:

_iterations_until_gc = 1000
iterations_since_gc = 0
def update_garbage_collector():
  global iterations_since_gc
  if iterations_since_gc >= _iterations_until_gc:
     gc.collect()
     iterations_since_gc = 0
  else:
     iterations_since_gc += 1

So the memory consumption of the script should be constant, but in practice
it looks like there is some memory leak, maybe Spark is introducing some
memory leak when redirecting the IO in pipe()?  Has any of you experienced
similar situations when using pipe in Spark? Also, do you know how could I
control the amount of memory reserved for the subprocess that is created by
pipe. I understand than with --executor-memory I set the memory for the
Spark executor process, but not for the subprocess created by pipe.

Thanks in advance for your help.

Greetings,

Juan

Memory problems when calling pipe()

Reply via email to