Re: Memory problems when calling pipe()

2015-02-24 Thread Juan Rodríguez Hortalá
Hi,

I finally solved the problem by setting spark.yarn.executor.memoryOverhead
with the option --conf spark.yarn.executor.memoryOverhead= for
spark-submit, as pointed out in
http://stackoverflow.com/questions/28404714/yarn-why-doesnt-task-go-out-of-heap-space-but-container-gets-killed
and https://issues.apache.org/jira/browse/SPARK-2444, and now it works ok.

Greetings,

Juan

2015-02-23 10:40 GMT+01:00 Juan Rodríguez Hortalá 
juan.rodriguez.hort...@gmail.com:

  Hi,

 I'm having problems using pipe() from a Spark program written in Java,
 where I call a python script, running in a YARN cluster. The problem is
 that the job fails when YARN kills the container because the python script
 is going beyond the memory limits. I get something like this in the log:

 01_04. Exit status: 143. Diagnostics: Container
 [pid=6976,containerID=container_1424279690678_0078_01_04] is running
 beyond physical memory limits. Current usage: 7.5 GB of 7.5 GB physical
 memory used; 8.6 GB of 23.3 GB virtual memory used. Killing container.
 Dump of the process-tree for container_1424279690678_0078_01_04 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 6976 1457 6976 6976 (bash) 0 0 108613632 338 /bin/bash -c
 /usr/java/jdk1.7.0_71/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms7048m -Xmx7048m
 -Djava.io.tmpdir=/mnt/data1/hadoop/yarn/local/usercache/root/appcache/application_1424279690678_0078/container_1424279690678_0078_01_04/tmp
 '-Dspark.driver.port=33589' '-Dspark.ui.port=0'
 -Dspark.yarn.app.container.log.dir=/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_04
 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://
 sparkdri...@slave3.lambdoop.com:33589/user/CoarseGrainedScheduler 5
 slave1.lambdoop.com 1 application_1424279690678_0078 1
 /mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_04/stdout
 2
 /mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_04/stderr

 |- 10513 6982 6976 6976 (python2.7) 9308 1224 448360448 13857
 /usr/local/bin/python2.7 /mnt/my_script.py my_args
 |- 6982 6976 6976 6976 (java) 115176 12032 8632229888 1951974
 /usr/java/jdk1.7.0_71/bin/java -server -XX:OnOutOfMemoryError=kill %p
 -Xms7048m -Xmx7048m
 -Djava.io.tmpdir=/mnt/data1/hadoop/yarn/local/usercache/root/appcache/application_1424279690678_0078/container_1424279690678_0078_01_04/tmp
 -Dspark.driver.port=33589 -Dspark.ui.port=0
 -Dspark.yarn.app.container.log.dir=/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_04
 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://
 sparkdri...@slave3.lambdoop.com:33589/user/CoarseGrainedScheduler 5
 slave1.lambdoop.com 1 application_1424279690678_0078

 Container killed on request. Exit code is 143
 Container exited with a non-zero exit code 143

 I find this strange because the python script process each input line
 separately, and makes a simple independent calculation per line: it
 basically parses the line, calcules the Haversine distance, and returns a
 double value. Input lines are traversed in python with a loop for line in
 sys.stdin. Also to avoid memory leaks in Python:
 - I call sys.stdout.flush() per each output line generated by python.
 - I call the following function after writing each output line, to force
 garbage collection regularly in Python:

 _iterations_until_gc = 1000
 iterations_since_gc = 0
 def update_garbage_collector():
   global iterations_since_gc
   if iterations_since_gc = _iterations_until_gc:
  gc.collect()
  iterations_since_gc = 0
   else:
  iterations_since_gc += 1

 So the memory consumption of the script should be constant, but in
 practice it looks like there is some memory leak, maybe Spark is
 introducing some memory leak when redirecting the IO in pipe()?  Has any of
 you experienced similar situations when using pipe in Spark? Also, do you
 know how could I control the amount of memory reserved for the subprocess
 that is created by pipe. I understand than with --executor-memory I set the
 memory for the Spark executor process, but not for the subprocess created
 by pipe.

 Thanks in advance for your help.

 Greetings,

 Juan




Memory problems when calling pipe()

2015-02-23 Thread Juan Rodríguez Hortalá
 Hi,

I'm having problems using pipe() from a Spark program written in Java,
where I call a python script, running in a YARN cluster. The problem is
that the job fails when YARN kills the container because the python script
is going beyond the memory limits. I get something like this in the log:

01_04. Exit status: 143. Diagnostics: Container
[pid=6976,containerID=container_1424279690678_0078_01_04] is running
beyond physical memory limits. Current usage: 7.5 GB of 7.5 GB physical
memory used; 8.6 GB of 23.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_1424279690678_0078_01_04 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 6976 1457 6976 6976 (bash) 0 0 108613632 338 /bin/bash -c
/usr/java/jdk1.7.0_71/bin/java -server -XX:OnOutOfMemoryError='kill %p'
-Xms7048m -Xmx7048m
-Djava.io.tmpdir=/mnt/data1/hadoop/yarn/local/usercache/root/appcache/application_1424279690678_0078/container_1424279690678_0078_01_04/tmp
'-Dspark.driver.port=33589' '-Dspark.ui.port=0'
-Dspark.yarn.app.container.log.dir=/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_04
org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://
sparkdri...@slave3.lambdoop.com:33589/user/CoarseGrainedScheduler 5
slave1.lambdoop.com 1 application_1424279690678_0078 1
/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_04/stdout
2
/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_04/stderr

|- 10513 6982 6976 6976 (python2.7) 9308 1224 448360448 13857
/usr/local/bin/python2.7 /mnt/my_script.py my_args
|- 6982 6976 6976 6976 (java) 115176 12032 8632229888 1951974
/usr/java/jdk1.7.0_71/bin/java -server -XX:OnOutOfMemoryError=kill %p
-Xms7048m -Xmx7048m
-Djava.io.tmpdir=/mnt/data1/hadoop/yarn/local/usercache/root/appcache/application_1424279690678_0078/container_1424279690678_0078_01_04/tmp
-Dspark.driver.port=33589 -Dspark.ui.port=0
-Dspark.yarn.app.container.log.dir=/mnt/data1/hadoop/yarn/log/application_1424279690678_0078/container_1424279690678_0078_01_04
org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://
sparkdri...@slave3.lambdoop.com:33589/user/CoarseGrainedScheduler 5
slave1.lambdoop.com 1 application_1424279690678_0078

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

I find this strange because the python script process each input line
separately, and makes a simple independent calculation per line: it
basically parses the line, calcules the Haversine distance, and returns a
double value. Input lines are traversed in python with a loop for line in
sys.stdin. Also to avoid memory leaks in Python:
- I call sys.stdout.flush() per each output line generated by python.
- I call the following function after writing each output line, to force
garbage collection regularly in Python:

_iterations_until_gc = 1000
iterations_since_gc = 0
def update_garbage_collector():
  global iterations_since_gc
  if iterations_since_gc = _iterations_until_gc:
 gc.collect()
 iterations_since_gc = 0
  else:
 iterations_since_gc += 1

So the memory consumption of the script should be constant, but in practice
it looks like there is some memory leak, maybe Spark is introducing some
memory leak when redirecting the IO in pipe()?  Has any of you experienced
similar situations when using pipe in Spark? Also, do you know how could I
control the amount of memory reserved for the subprocess that is created by
pipe. I understand than with --executor-memory I set the memory for the
Spark executor process, but not for the subprocess created by pipe.

Thanks in advance for your help.

Greetings,

Juan