You said your hdfs cluster and spark cluster is running on different cluster.This is not a good idea,because you should consider data locality.Your spark node need config hdfs client configuration. Spark Job is composed of stages,each stage have one or more partitions。Parallelism of job decided by these partitions. Shuffle process is decided by your operator,like reduceByKey、repartition、sortBy and so on.
-- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org