Question about executor memory setting
Hi all, May I ask a question about executor memory setting? I was running PageRank with input size 2.8GB on one workstation for testing. I gave PageRank one executor. In case 1, I set --executor-cores to 4, and --executor-memory to 1GB, the stage (stage 2) completion time is 14 min, the the detailed stage info is below: In case 2, I set --executor-cores to 4, and --executor-memory to 6GB, the stage (stage 2) completion time is 34 min, the the detailed stage info is below: I am totally confused why when executor-memory gets larger, the stage completion time is more than two times slower? From the web UI, I found that when executor memory is 6GB, the shuffle spill (Disk) per task is smaller, which means fewer IO operations, but weirdly, the task completion time is longer though. Could anyone give me some hints? Great thanks!
How to decide the number of tasks in Spark?
Hi, When launching a job in Spark, I have great trouble deciding the number of tasks. Someone says it is better to create a task per HDFS block size, i.e., make sure one task process 128MB of input data; others suggest that the number of tasks should be the twice of the total cores available to the job. Also, I found that someone suggests launching small tasks using Spark, i.e., make sure each task lasts around 100ms. I am quite confused about all these suggestions. Is there any general rule for deciding the number of tasks in Spark? Great thanks! Best
Question about Spark shuffle read size
Hi all, When I run WordCount using Spark, I find that when I set "spark.default.parallelism" to different numbers, the Shuffle Write size and Shuffle Read size will change as well (I read these data from history server's web UI). Is it because the shuffle write size also include some metadata size? Also, my input file for WordCount is approximately 3kB (stored in local filesystem), and I partitioned it to 10 pieces using textFile function. However, the web UI shows that WordCount's input data size is 19.5 kB, much larger than input data size. Why would that happen? Great thanks!
Re: How to compile Spark with customized Hadoop?
Hi, When I publish my version of Hadoop, it is installed in: /HOME_DIRECTORY/.m2/repository/org/apache/hadoop, but when I compile Spark, it will fetch Hadoop libraries from https://repo1.maven.org/maven2/org/apache/hadoop. How can I let Spark fetch Hadoop libraries from my local M2 cache? Great thanks! On Fri, Oct 9, 2015 at 5:31 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > You can publish your version of Hadoop to your Maven cache with mvn > publish (just give it a different version number, e.g. 2.7.0a) and then > pass that as the Hadoop version to Spark's build (see > http://spark.apache.org/docs/latest/building-spark.html). > > Matei > > On Oct 9, 2015, at 3:10 PM, Dogtail L <spark.ru...@gmail.com> wrote: > > Hi all, > > I have modified Hadoop source code, and I want to compile Spark with my > modified Hadoop. Do you know how to do that? Great thanks! > > >
How to compile Spark with customized Hadoop?
Hi all, I have modified Hadoop source code, and I want to compile Spark with my modified Hadoop. Do you know how to do that? Great thanks!