Hello, I have a spark job that basically reads data from two tables into two Dataframes which are subsequently converted to RDD's. I, then, join them based on a common key. Each table is about 10 TB in size but after filtering, the two RDD's are about 500GB each. I have 800 executors with 8GB memory per executor. Everything works fine until the join stage. But, the join stage is throwing the below error. I tried increasing the partitions before the join stage but it doesn't change anything. Any ideas, how I can fix this and what I might be doing wrong? Thanks, Vinay
ExecutorLostFailure (executor 208 exited caused by one of the running tasks) Reason: Container marked as failed: container_1469773002212_96618_01_000246 on host:. Exit status: 143. Diagnostics: Container [pid=31872,containerID=container_1469773002212_96618_01_000246] is running beyond physical memory limits. Current usage: 15.2 GB of 15.1 GB physical memory used; 15.9 GB of 31.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1469773002212_96618_01_000246 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 31883 31872 31872 31872 (java) 519517 41888 17040175104 3987193 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms14336m -Xmx14336m -Djava.io.tmpdir=/hadoop/11/scratch/local/usercacheappcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/tmp -Dspark.driver.port=32988 -Dspark.ui.port=0 -Dspark.akka.frameSize=256 -Dspark.yarn.app.container.log.dir=/hadoop/12/scratch/logs/application_1469773002212_96618/container_1469773002212_96618_01_000246 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@10.12.7.4:32988 --executor-id 208 -hostname x.com --cores 11 --app-id application_1469773002212_96618 --user-class-path file:/hadoop/11/scratch/local/usercache /appcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/__app__.jar --user-class-path file:/hadoop/11/scratch/local/usercache/ appcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/mysql-connector-java-5.0.8-bin.jar --user-class-path file:/hadoop/11/scratch/local/usercache/appcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/datanucleus-core-3.2.10.jar --user-class-path file:/hadoop/11/scratch/local/usercache/appcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/datanucleus-api-jdo-3.2.6.jar --user-class-path file:/hadoop/11/scratch/local/usercache/appcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/datanucleus-rdbms-3.2.9.jar |- 31872 16580 31872 31872 (bash) 0 0 9146368 267 /bin/bash -c LD_LIBRARY_PATH=/apache/hadoop/lib/native:/apache/hadoop/lib/native/Linux-amd64-64: /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms14336m -Xmx14336m -Djava.io.tmpdir=/hadoop/11/scratch/local/usercache/ appcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/tmp '-Dspark.driver.port=32988' '-Dspark.ui.port=0' '-Dspark.akka.frameSize=256' -Dspark.yarn.app.container.log.dir=/hadoop/12/scratch/logs/application_1469773002212_96618/container_1469773002212_96618_01_000246 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@1.4.1.6:32988 --executor-id 208 --hostname x.com --cores 11 --app-id application_1469773002212_96618 --user-class-path file:/hadoop/11/scratch/local/usercache/ appcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/__app__.jar --user-class-path file:/hadoop/11/scratch/local/usercache/appcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/mysql-connector-java-5.0.8-bin.jar --user-class-path file:/hadoop/11/scratch/local/usercache/appcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/datanucleus-core-3.2.10.jar --user-class-path file:/hadoop/11/scratch/local/usercache/appcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/datanucleus-api-jdo-3.2.6.jar --user-class-path file:/hadoop/11/scratch/local/usercache/appcache/application_1469773002212_96618/container_1469773002212_96618_01_000246/datanucleus-rdbms-3.2.9.jar 1> /hadoop/12/scratch/logs/application_1469773002212_96618/container_1469773002212_96618_01_000246/stdout 2> /hadoop/12/scratch/logs/application_1469773002212_96618/container_1469773002212_96618_01_000246/stderr Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143