Re: Large number of pyspark.daemon processes
After slimming down the job quite a bit, it looks like a call to coalesce() on a larger RDD can cause these Python worker spikes (additional details in Jira: https://issues.apache.org/jira/browse/SPARK-5395?focusedCommentId=14294570page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14294570 ). Any ideas to what coalesce() is doing that triggers the creation of additional workers? On Sat, Jan 24, 2015 at 12:27 AM, Sven Krasser kras...@gmail.com wrote: Hey Davies, Sure thing, it's filed here now: https://issues.apache.org/jira/browse/SPARK-5395 As far as a repro goes, what is a normal number of workers I should expect? Even shortly after kicking the job off, I see workers in the double-digits per container. Here's an example using pstree on a worker node (the worker node runs 16 containers, i.e. the java--bash branches below). The initial Python process per container is forked between 7 and 33 times. ├─java─┬─bash───java─┬─python───22*[python] │ │ └─89*[{java}] │ ├─bash───java─┬─python───13*[python] │ │ └─80*[{java}] │ ├─bash───java─┬─python───9*[python] │ │ └─78*[{java}] │ ├─bash───java─┬─python───24*[python] │ │ └─91*[{java}] │ ├─3*[bash───java─┬─python───19*[python]] │ │└─86*[{java}]] │ ├─bash───java─┬─python───25*[python] │ │ └─90*[{java}] │ ├─bash───java─┬─python───33*[python] │ │ └─98*[{java}] │ ├─bash───java─┬─python───7*[python] │ │ └─75*[{java}] │ ├─bash───java─┬─python───15*[python] │ │ └─80*[{java}] │ ├─bash───java─┬─python───18*[python] │ │ └─84*[{java}] │ ├─bash───java─┬─python───10*[python] │ │ └─85*[{java}] │ ├─bash───java─┬─python───26*[python] │ │ └─90*[{java}] │ ├─bash───java─┬─python───17*[python] │ │ └─85*[{java}] │ ├─bash───java─┬─python───24*[python] │ │ └─92*[{java}] │ └─265*[{java}] Each container has 2 cores in case that makes a difference. Thank you! -Sven On Fri, Jan 23, 2015 at 11:52 PM, Davies Liu dav...@databricks.com wrote: It should be a bug, the Python worker did not exit normally, could you file a JIRA for this? Also, could you show how to reproduce this behavior? On Fri, Jan 23, 2015 at 11:45 PM, Sven Krasser kras...@gmail.com wrote: Hey Adam, I'm not sure I understand just yet what you have in mind. My takeaway from the logs is that the container actually was above its allotment of about 14G. Since 6G of that are for overhead, I assumed there to be plenty of space for Python workers, but there seem to be more of those than I'd expect. Does anyone know if that is actually the intended behavior, i.e. in this case over 90 Python processes on a 2 core executor? Best, -Sven On Fri, Jan 23, 2015 at 10:04 PM, Adam Diaz adam.h.d...@gmail.com wrote: Yarn only has the ability to kill not checkpoint or sig suspend. If you use too much memory it will simply kill tasks based upon the yarn config. https://issues.apache.org/jira/browse/YARN-2172 On Friday, January 23, 2015, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Sven, What version of Spark are you running? Recent versions have a change that allows PySpark to share a pool of processes instead of starting a new one for each task. -Sandy On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser kras...@gmail.com wrote: Hey all, I am running into a problem where YARN kills containers for being over their memory allocation (which is about 8G for executors plus 6G for overhead), and I noticed that in those containers there are tons of pyspark.daemon processes hogging memory. Here's a snippet from a container with 97 pyspark.daemon processes. The total sum of RSS usage across all of these is 1,764,956 pages (i.e. 6.7GB on the system). Any ideas what's happening here and how I can get the number of pyspark.daemon processes back to a more reasonable count? 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625
Re: Large number of pyspark.daemon processes
Hey Davies, Sure thing, it's filed here now: https://issues.apache.org/jira/browse/SPARK-5395 As far as a repro goes, what is a normal number of workers I should expect? Even shortly after kicking the job off, I see workers in the double-digits per container. Here's an example using pstree on a worker node (the worker node runs 16 containers, i.e. the java--bash branches below). The initial Python process per container is forked between 7 and 33 times. ├─java─┬─bash───java─┬─python───22*[python] │ │ └─89*[{java}] │ ├─bash───java─┬─python───13*[python] │ │ └─80*[{java}] │ ├─bash───java─┬─python───9*[python] │ │ └─78*[{java}] │ ├─bash───java─┬─python───24*[python] │ │ └─91*[{java}] │ ├─3*[bash───java─┬─python───19*[python]] │ │└─86*[{java}]] │ ├─bash───java─┬─python───25*[python] │ │ └─90*[{java}] │ ├─bash───java─┬─python───33*[python] │ │ └─98*[{java}] │ ├─bash───java─┬─python───7*[python] │ │ └─75*[{java}] │ ├─bash───java─┬─python───15*[python] │ │ └─80*[{java}] │ ├─bash───java─┬─python───18*[python] │ │ └─84*[{java}] │ ├─bash───java─┬─python───10*[python] │ │ └─85*[{java}] │ ├─bash───java─┬─python───26*[python] │ │ └─90*[{java}] │ ├─bash───java─┬─python───17*[python] │ │ └─85*[{java}] │ ├─bash───java─┬─python───24*[python] │ │ └─92*[{java}] │ └─265*[{java}] Each container has 2 cores in case that makes a difference. Thank you! -Sven On Fri, Jan 23, 2015 at 11:52 PM, Davies Liu dav...@databricks.com wrote: It should be a bug, the Python worker did not exit normally, could you file a JIRA for this? Also, could you show how to reproduce this behavior? On Fri, Jan 23, 2015 at 11:45 PM, Sven Krasser kras...@gmail.com wrote: Hey Adam, I'm not sure I understand just yet what you have in mind. My takeaway from the logs is that the container actually was above its allotment of about 14G. Since 6G of that are for overhead, I assumed there to be plenty of space for Python workers, but there seem to be more of those than I'd expect. Does anyone know if that is actually the intended behavior, i.e. in this case over 90 Python processes on a 2 core executor? Best, -Sven On Fri, Jan 23, 2015 at 10:04 PM, Adam Diaz adam.h.d...@gmail.com wrote: Yarn only has the ability to kill not checkpoint or sig suspend. If you use too much memory it will simply kill tasks based upon the yarn config. https://issues.apache.org/jira/browse/YARN-2172 On Friday, January 23, 2015, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Sven, What version of Spark are you running? Recent versions have a change that allows PySpark to share a pool of processes instead of starting a new one for each task. -Sandy On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser kras...@gmail.com wrote: Hey all, I am running into a problem where YARN kills containers for being over their memory allocation (which is about 8G for executors plus 6G for overhead), and I noticed that in those containers there are tons of pyspark.daemon processes hogging memory. Here's a snippet from a container with 97 pyspark.daemon processes. The total sum of RSS usage across all of these is 1,764,956 pages (i.e. 6.7GB on the system). Any ideas what's happening here and how I can get the number of pyspark.daemon processes back to a more reasonable count? 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Thank you! -Sven -- krasser -- http://sites.google.com/site/krasser/?utm_source=sig -- http://sites.google.com/site/krasser/?utm_source=sig
Large number of pyspark.daemon processes
Hey all, I am running into a problem where YARN kills containers for being over their memory allocation (which is about 8G for executors plus 6G for overhead), and I noticed that in those containers there are tons of pyspark.daemon processes hogging memory. Here's a snippet from a container with 97 pyspark.daemon processes. The total sum of RSS usage across all of these is 1,764,956 pages (i.e. 6.7GB on the system). Any ideas what's happening here and how I can get the number of pyspark.daemon processes back to a more reasonable count? 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Thank you! -Sven -- http://sites.google.com/site/krasser/?utm_source=sig
Re: Large number of pyspark.daemon processes
Hi Sven, What version of Spark are you running? Recent versions have a change that allows PySpark to share a pool of processes instead of starting a new one for each task. -Sandy On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser kras...@gmail.com wrote: Hey all, I am running into a problem where YARN kills containers for being over their memory allocation (which is about 8G for executors plus 6G for overhead), and I noticed that in those containers there are tons of pyspark.daemon processes hogging memory. Here's a snippet from a container with 97 pyspark.daemon processes. The total sum of RSS usage across all of these is 1,764,956 pages (i.e. 6.7GB on the system). Any ideas what's happening here and how I can get the number of pyspark.daemon processes back to a more reasonable count? 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Thank you! -Sven -- http://sites.google.com/site/krasser/?utm_source=sig
Re: Large number of pyspark.daemon processes
Yarn only has the ability to kill not checkpoint or sig suspend. If you use too much memory it will simply kill tasks based upon the yarn config. https://issues.apache.org/jira/browse/YARN-2172 On Friday, January 23, 2015, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Sven, What version of Spark are you running? Recent versions have a change that allows PySpark to share a pool of processes instead of starting a new one for each task. -Sandy On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser kras...@gmail.com javascript:_e(%7B%7D,'cvml','kras...@gmail.com'); wrote: Hey all, I am running into a problem where YARN kills containers for being over their memory allocation (which is about 8G for executors plus 6G for overhead), and I noticed that in those containers there are tons of pyspark.daemon processes hogging memory. Here's a snippet from a container with 97 pyspark.daemon processes. The total sum of RSS usage across all of these is 1,764,956 pages (i.e. 6.7GB on the system). Any ideas what's happening here and how I can get the number of pyspark.daemon processes back to a more reasonable count? 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Thank you! -Sven -- krasser http://sites.google.com/site/krasser/?utm_source=sig
Re: Large number of pyspark.daemon processes
Hey Adam, I'm not sure I understand just yet what you have in mind. My takeaway from the logs is that the container actually was above its allotment of about 14G. Since 6G of that are for overhead, I assumed there to be plenty of space for Python workers, but there seem to be more of those than I'd expect. Does anyone know if that is actually the intended behavior, i.e. in this case over 90 Python processes on a 2 core executor? Best, -Sven On Fri, Jan 23, 2015 at 10:04 PM, Adam Diaz adam.h.d...@gmail.com wrote: Yarn only has the ability to kill not checkpoint or sig suspend. If you use too much memory it will simply kill tasks based upon the yarn config. https://issues.apache.org/jira/browse/YARN-2172 On Friday, January 23, 2015, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Sven, What version of Spark are you running? Recent versions have a change that allows PySpark to share a pool of processes instead of starting a new one for each task. -Sandy On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser kras...@gmail.com wrote: Hey all, I am running into a problem where YARN kills containers for being over their memory allocation (which is about 8G for executors plus 6G for overhead), and I noticed that in those containers there are tons of pyspark.daemon processes hogging memory. Here's a snippet from a container with 97 pyspark.daemon processes. The total sum of RSS usage across all of these is 1,764,956 pages (i.e. 6.7GB on the system). Any ideas what's happening here and how I can get the number of pyspark.daemon processes back to a more reasonable count? 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Thank you! -Sven -- krasser http://sites.google.com/site/krasser/?utm_source=sig -- http://sites.google.com/site/krasser/?utm_source=sig
Re: Large number of pyspark.daemon processes
It should be a bug, the Python worker did not exit normally, could you file a JIRA for this? Also, could you show how to reproduce this behavior? On Fri, Jan 23, 2015 at 11:45 PM, Sven Krasser kras...@gmail.com wrote: Hey Adam, I'm not sure I understand just yet what you have in mind. My takeaway from the logs is that the container actually was above its allotment of about 14G. Since 6G of that are for overhead, I assumed there to be plenty of space for Python workers, but there seem to be more of those than I'd expect. Does anyone know if that is actually the intended behavior, i.e. in this case over 90 Python processes on a 2 core executor? Best, -Sven On Fri, Jan 23, 2015 at 10:04 PM, Adam Diaz adam.h.d...@gmail.com wrote: Yarn only has the ability to kill not checkpoint or sig suspend. If you use too much memory it will simply kill tasks based upon the yarn config. https://issues.apache.org/jira/browse/YARN-2172 On Friday, January 23, 2015, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Sven, What version of Spark are you running? Recent versions have a change that allows PySpark to share a pool of processes instead of starting a new one for each task. -Sandy On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser kras...@gmail.com wrote: Hey all, I am running into a problem where YARN kills containers for being over their memory allocation (which is about 8G for executors plus 6G for overhead), and I noticed that in those containers there are tons of pyspark.daemon processes hogging memory. Here's a snippet from a container with 97 pyspark.daemon processes. The total sum of RSS usage across all of these is 1,764,956 pages (i.e. 6.7GB on the system). Any ideas what's happening here and how I can get the number of pyspark.daemon processes back to a more reasonable count? 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Thank you! -Sven -- krasser -- http://sites.google.com/site/krasser/?utm_source=sig - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Large number of pyspark.daemon processes
Hey Sandy, I'm using Spark 1.2.0. I assume you're referring to worker reuse? In this case I've already set spark.python.worker.reuse to false (but it I also so this behavior when keeping it enabled). Best, -Sven On Fri, Jan 23, 2015 at 4:51 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Sven, What version of Spark are you running? Recent versions have a change that allows PySpark to share a pool of processes instead of starting a new one for each task. -Sandy On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser kras...@gmail.com wrote: Hey all, I am running into a problem where YARN kills containers for being over their memory allocation (which is about 8G for executors plus 6G for overhead), and I noticed that in those containers there are tons of pyspark.daemon processes hogging memory. Here's a snippet from a container with 97 pyspark.daemon processes. The total sum of RSS usage across all of these is 1,764,956 pages (i.e. 6.7GB on the system). Any ideas what's happening here and how I can get the number of pyspark.daemon processes back to a more reasonable count? 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Thank you! -Sven -- http://sites.google.com/site/krasser/?utm_source=sig -- http://sites.google.com/site/krasser/?utm_source=sig