Hi All,

I need some help with a problem in pyspark which is causing a major issue. 

Recently I've noticed that the behaviour of the python.deamons on the worker
nodes for compute-intensive tasks have changed from using all the avaliable
cores to using only a single core. On each worker node, 8 python.deamons
exist but they all seem to run on a single core. The remaining 7 cores idle.

Our hardware consists of 9 hosts (1x driver node and 8x worker nodes) each
with 8 cores and 64gb RAM - we are using Spark 1.2.0-SNAPSHOT (Python) in
standalone mode via Cloudera 5.3.2. 

To give a better understanding of the problem I have made a quick script
from one of the given examples which replicates the problem:

When I run the calculating_pies.py script using the command "spark-submit
calculate_pies.py" this is what I typically see on all my worker nodes:

  1  [|||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]      5 
[|||                                                  3.9%]
  2  [                                                                     
0.0%]      6  [|||                                                  2.6%]
  3  [||                                                                   
1.3%]     7  [|||||                                                8.4%]
  4  [||                                                                   
1.3%]     8  [||||||                                               8.7%]

  PID USER      PRI  NI  VIRT   RES    SHR S CPU% MEM%   TIME+  Command
30672 spark    20   0   225M  112M  1156 R 13.0   0.2       0:03.10 python
-m pyspark.daemon
30681 spark    20   0   225M  112M  1152 R 13.0   0.2       0:03.10 python
-m pyspark.daemon
30687 spark    20   0   225M  112M  1152 R 13.0   0.2       0:03.10 python
-m pyspark.daemon
30678 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.10 python
-m pyspark.daemon
30693 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.08 python
-m pyspark.daemon
30674 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.10 python
-m pyspark.daemon
30688 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.08 python
-m pyspark.daemon
30684 spark    20   0   225M  112M  1152 R 12.0   0.2       0:03.10 python
-m pyspark.daemon

Through the spark UI I do see 8 executor ids with 8 active tasks on each. I
also see the same behaviour if I use the flag --total-executor-cores 64 in
spark-submit.

Strangly, If I run the same script in local mode everything seems to run
fine. This is what I see

  1  [||||||||||||||||||||||||||||||||||||||||||||||100.0%]     5 
[||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  2  [||||||||||||||||||||||||||||||||||||||||||||||100.0%]     6 
[||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  3  [||||||||||||||||||||||||||||||||||||||||||||||100.0%]     7 
[||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  4  [|||||||||||||||||||||||||||||||||||||||||||||||99.3%]     8 
[|||||||||||||||||||||||||||||||||||||||||||||||99.4%]

  PID USER      PRI  NI  VIRT   RES    SHR  S CPU% MEM%   TIME+  Command
22519 data       20   0  225M  106M  1368 R 99.0   0.2        0:10.97 python
-m pyspark.daemon
22508 data       20   0  225M  106M  1368 R 99.0   0.2        0:10.92 python
-m pyspark.daemon
22513 data       20   0  225M  106M  1368 R 99.0   0.2        0:11.02 python
-m pyspark.daemon
22526 data       20   0  225M  106M  1368 R 99.0   0.2        0:10.84 python
-m pyspark.daemon
22522 data       20   0  225M  106M  1368 R 98.0   0.2        0:10.95 python
-m pyspark.daemon
22523 data       20   0  225M  106M  1368 R 97.0   0.2        0:10.92 python
-m pyspark.daemon
22507 data       20   0  225M  106M  1368 R 97.0   0.2        0:10.83 python
-m pyspark.daemon
22516 data       20   0  225M  106M  1368 R 93.0   0.2        0:10.88 python
-m pyspark.daemon


============== calculating_pies.py ==================

#!/usr/bin/pyspark

import random
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.storagelevel import StorageLevel

def pi(NUM_SAMPLE = 1000000):
    count = 0.
    for i in xrange(NUM_SAMPLE):
        x, y = random.random(), random.random()
        if (x * x) + (y * y) < 1:
            count += 1
    return 4.0 * (count / NUM_SAMPLE)

if __name__ == "__main__":

    sconf = (SparkConf()
             .set('spark.default.parallelism','256')
             .set('spark.app.name', 'Calculating PI'))

    # local
    # sc = SparkContext(conf=sconf)

    # standalone
    sc = SparkContext("spark://<driver_host>:7077", conf=sconf)

    # yarn
    # sc = SparkContext("yarn-client", conf=sconf)

    rdd_pies = sc.parallelize(range(10000), 1000)
    rdd_pies.map(lambda x: pi()).collect()
    sc.stop()

=====================================================

Does anyone have any suggestions, or know of any config we should be looking
at that could solve this problem? Does anyone else see the same problem?

Any help is appreciated, Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-not-using-all-cores-tp21989.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to