Also, I found that the 'daemon.py' will continue running on one worker node even after I terminated the spark job at master node. A little strange for me.
2013/10/8 Shangyu Luo <[email protected]> > Hello Jey, > Thank you for answering. I have found that there are about 6 or 7 > 'daemon.py' processes in one worker node. Will each core have a 'daemon.py' > process? How to decide how many 'daemon.py' processes in one worker node? I > have also found that there are many spark related java process in a worker > node, so if the java process on worker node is just responsible for > communication, why spark needs so many java processes? > Overall, I think the main problem I have for my program is the memory > allocation. More specifically, in spark-env.sh, there are two options, * > SPARK_DAEMON_MEMORY* and *SPARK_DAEMON_JAVA_OPTS*. I can also set up * > spark.executor.memory* in SPARK_JAVA_OPTS. So if I have 68g memory in a > worker node, how should I distribute memory for these options? At present, > I use the default value for SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS > and set spark.executor.memory to 20g. It seems that spark will add rdd to > spark.executor.memory and I find that each 'daemon.py' will also consume > about 7g memory. Now when running my program for a while, the program will > use up all memory on a worker node and the master node will report > connection errors. (I have 5 worker nodes, each has 8 cores) So I am a > little confused about the jobs that the three options are responsible for > and how to distribute memories to them. > Any suggestion will be appreciated. > Thanks! > > Best, > Shangyu > > > 2013/10/8 Jey Kottalam <[email protected]> > >> Hi Shangyu, >> >> The daemon.py python process is the actual PySpark worker process, and >> is launched by the Spark worker when running Python jobs. So, when >> using PySpark, the "real computation" is handled by a python process >> (via daemon.py), not a java process. >> >> Hope that helps, >> -Jey >> >> On Mon, Oct 7, 2013 at 9:50 PM, Shangyu Luo <[email protected]> wrote: >> > Hello! >> > I am using Spark 0.7.3 with python version. Recently when I run some >> spark >> > program on a cluster, I found that some processes invoked by >> > spark-0.7.3/python/pyspark/daemon.py would capturing CPU for a long >> time and >> > consume much memory (e.g., 5g for each process). It seemed that the java >> > process, which was invoked by >> > java -cp >> > >> :/usr/lib/spark-0.7.3/conf:/usr/lib/spark-0.7.3/core/target/scala-2.9.3/classes >> > ... , was 'competing' with the daemon.py for CPU resources. From my >> > understanding, the java process should be responsible for the 'real' >> > computation in spark. >> > So I am wondering what job the daemon.py will work on? Is it normal for >> it >> > to consume a lot of CPU and memory? >> > Thanks! >> > >> > >> > Best, >> > Shangyu Luo >> > -- >> > -- >> > >> > Shangyu, Luo >> > Department of Computer Science >> > Rice University >> > >> > > > > -- > -- > > Shangyu, Luo > Department of Computer Science > Rice University > > -- > Not Just Think About It, But Do It! > -- > Success is never final. > -- > Losers always whine about their best > -- -- Shangyu, Luo Department of Computer Science Rice University -- Not Just Think About It, But Do It! -- Success is never final. -- Losers always whine about their best
